Paper Reviews

WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips

Abstract

The deployment of large language models (LLMs) imposes significant demands on computing, memory, and communication resources. Wafer-scale technology enables the high-density integration of multiple single-die chips with high-speed Die-to-Die (D2D) ...

ACM LINK

Reviews

Review 1

Reviewer: The Guardian

Summary

The authors present WSC-LLM, a co-exploration framework designed to optimize Large Language Model (LLM) serving on wafer-scale chip architectures. The framework proposes to jointly explore hardware parameters (such as DRAM capacity and interconnect bandwidth) and software scheduling strategies (such as resource partitioning, placement, and memory management for prefill and decode phases). The core contributions include a Central Scheduler for resource allocation, a Memory Scheduler for KV cache management, and an evaluation methodology based on an extended version of the ASTRA-sim simulator. The paper claims that a wafer-scale architecture with moderate DRAM capacity yields the best performance and that their proposed framework significantly outperforms a state-of-the-art GPU-based serving system (Splitwise) by an average of 3.12x.

Strengths

The paper addresses a timely and significant problem at the intersection of LLM serving and next-generation hardware design. The challenges posed by LLM inference on wafer-scale systems are non-trivial, and the authors correctly identify key trade-offs.
The conceptual separation of the Central Scheduler and Memory Scheduler is logical, addressing distinct optimization problems (computation/placement vs. data management) that arise in disaggregated LLM inference.
The ablation study presented in Section 5.4 provides some insight into the relative contributions of the proposed scheduling components, suggesting that memory management becomes increasingly critical for larger models.

Weaknesses

The paper's conclusions are built upon a methodological foundation that lacks the rigor required for this venue. The claims, while significant, are not substantiated by the evidence provided.

Critically Flawed Evaluation Methodology: The entire quantitative analysis hinges on a simulator described in Section 4.6. The authors state they extend ASTRA-sim and, more critically, use a DNN to create a "mapping lookup table" to estimate performance metrics. The paper provides zero validation for this performance model. There is no information on the model's accuracy, error bounds, or how it was trained and tested against a ground truth (e.g., a cycle-accurate model or hardware measurement). The authors merely cite that other works [37, 89] have found this approach feasible. This is insufficient. Without this validation, every performance number, graph, and conclusion in Section 5 is speculative and cannot be trusted.
Confounded and Misleading Baseline Comparison: The central claim of a 3.12x performance improvement (Section 5.3, Figure 11) is derived from comparing the proposed simulated wafer-scale system to a real-world A100 GPU cluster. This is a fundamentally confounded comparison. The simulated wafer-scale chip is described with a total interconnect bandwidth of 6 TB/s per die (Section 5.1.1), orders of magnitude higher than the 400 GB/s inter-node bandwidth of the GPU cluster. The reported performance gains are almost certainly dominated by these vastly superior, and perhaps optimistic, hardware assumptions rather than the novelty of the scheduling framework itself. The paper fails to isolate the contribution of its scheduling algorithms from the contribution of the hypothetical hardware, making the headline claim unsubstantiated. The SW-Wafer experiment is a step in the right direction but is insufficient to fully de-confound these factors.
Overstated "Co-Exploration" Scope: The paper frames itself as an "architecture co-exploration framework." However, the actual exploration in Section 5.2 is a simple parameter sweep across just four pre-selected hardware configurations (Table 1). The fundamental architectural choices—a 2D-mesh topology, Dojo-style compute cores, and a specific die structure—are fixed. This is not a general exploration of the architectural design space but rather a tuning of a few parameters within a highly constrained template. The claims of generality are therefore not supported by the experiments.
Unjustified Heuristics and Parameters: The resource placement strategy in Section 4.2.2 relies on minimizing a TransferCost function which includes a hyperparameter α. The paper provides no details on how α is determined, its sensitivity, or its impact on the final placement. This introduces an element of arbitrariness into a key part of the methodology, potentially suggesting the results are cherry-picked based on a favorable but unexplained tuning.

Questions to Address In Rebuttal

The authors must address the following critical points to establish the validity of their work:

Simulator Validation: Please provide rigorous validation data for the DNN-based performance model described in Section 4.6. What is its prediction accuracy (e.g., Mean Absolute Percentage Error) against a ground truth for latency, DRAM access, and communication overhead? Without this, the results are not credible.
De-confounding Performance Gains: How can the authors de-confound the performance gains attributed to their scheduling algorithm from the gains stemming from the assumed hardware's massive bandwidth advantage? A more convincing experiment would be to implement a simplified version of the WSC-LLM scheduler on the GPU baseline to demonstrate its benefits on a fixed hardware platform.
Hyperparameter Justification: Please provide a justification and sensitivity analysis for the hyperparameter α used in the resource placement strategy (Section 4.2.2). How does the system's performance change with different values of α, and how was the value used in the experiments chosen?
Clarification of Constraints: Algorithm 1 (Section 4.2.1) iterates through "all feasible instance sizes." What are the constraints that determine feasibility? The text mentions a die limit and memory capacity, but is the search space truly exhaustive, and what is the typical runtime of this "offline" algorithm for the configurations tested?

Review 2

WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips

Review Form: The Synthesizer

Summary

This paper presents WSC-LLM, a novel co-exploration framework designed to optimize both the architecture of wafer-scale chips and the scheduling of Large Language Model (LLM) inference workloads. The authors correctly identify that the unique characteristics of wafer-scale hardware—namely, vast on-wafer bandwidth but a fixed area that forces trade-offs between compute, memory, and communication resources—create a complex, coupled optimization problem. The core contribution is a holistic framework that systematically explores this design space. It features a sophisticated Central Scheduler that partitions and places prefill/decode instances with awareness of the 2D-mesh topology, and a novel Memory Scheduler that leverages high Die-to-Die (D2D) bandwidth to manage KV cache storage across the entire wafer. Through a comprehensive design space exploration and comparison against state-of-the-art LLM serving systems, the paper demonstrates that a balanced wafer-scale architecture with moderate DRAM capacity yields the best performance, and that its scheduling strategies significantly outperform existing methods.

Strengths

The true strength of this paper lies in its ambitious and well-executed synthesis of two traditionally separate domains: computer architecture and distributed systems software.

Excellent Problem Formulation and Contextualization: The authors have done an outstanding job articulating the core tensions in designing wafer-scale systems for LLMs. Figure 1 (page 2) provides a remarkably clear and concise visualization of the fundamental architectural trade-off (DRAM capacity vs. DRAM costs) and the key inefficiencies in existing disaggregated scheduling approaches. This demonstrates a deep understanding of the problem that goes beyond a superficial application of known techniques. The work correctly positions itself at the intersection of wafer-scale computing (e.g., Cerebras, Tesla Dojo), chiplet integration, and disaggregated LLM inference systems (e.g., Splitwise).
Holistic Co-exploration Framework: The central idea of a unified framework for co-exploration is the paper's most significant contribution. Instead of proposing a fixed hardware architecture and then designing a scheduler for it, WSC-LLM provides a methodology for discovering a near-optimal hardware/software pairing. This is a crucial step forward for the field. As we move into an era of specialized hardware, such co-design frameworks will become indispensable, and this paper provides a strong early example in a highly relevant domain.
Actionable Architectural Insights: The study produces results that are not merely academic but provide genuine guidance for hardware architects. The conclusion in Section 5.2 (page 10) that a moderate DRAM capacity (Case 3) is superior to both lower (Cases 1, 2) and higher (Case 4) capacity designs is a powerful, non-obvious finding. It beautifully illustrates the law of diminishing returns, showing that beyond a certain point, the D2D bandwidth lost to accommodating more DRAM becomes the primary performance bottleneck. This is a critical insight for the future design of AI accelerators.
Novelty in Scheduling for Wafer-Scale Topologies: The paper successfully adapts and extends concepts from disaggregated inference to the unique 2D-mesh topology of wafer-scale systems. The topology-aware placement strategy (Section 4.2.2, page 6) and the wafer-wide Memory Scheduler (Section 4.4, page 7) are key innovations that directly exploit the hardware's strengths (high D2D bandwidth) to mitigate its constraints (communication locality). The ablation study in Section 5.4 (page 11) convincingly shows that the Memory Scheduler's contribution becomes increasingly vital for larger models, which is another significant finding.

Weaknesses

While the core ideas are strong, the paper could be improved by addressing the implications of some of its methodological abstractions and exploring the boundaries of its proposed solution.

Fidelity of the Performance Evaluator: The entire co-exploration process hinges on the accuracy of the Evaluator module (Section 4.6, page 9), which uses a DNN to predict performance and avoid full, slow simulations. While this is a standard and necessary technique for tractable design space exploration, its implications are not fully discussed. The validity of the paper's central claims rests on this model's fidelity. A brief discussion of the model's validation, its potential sources of error (e.g., modeling network contention), and the sensitivity of the final architectural choice to this error would significantly strengthen the work's foundations.
Static Nature of Resource Allocation: The resource partitioning and placement strategies (Section 4.2.1 and 4.2.2) are performed offline. This is a reasonable simplification for demonstrating the framework's potential. However, real-world serving environments are highly dynamic, with fluctuating request loads and potentially evolving model popularity. The current framework does not address how the system might adapt to such long-term changes, which could lead to suboptimal static partitioning. Acknowledging this as a scope limitation and an avenue for future work (e.g., dynamic re-partitioning) would provide a more complete picture.
Limited Scope of Explored Architectural Parameters: The design space exploration in Section 5.1.2 (page 9) focuses primarily on the number of DRAM chiplets, which in turn affects DRAM capacity/bandwidth and D2D bandwidth. While this is arguably the most critical trade-off, a true co-exploration could extend to other parameters like the size and number of compute cores per die, the on-die NoC bandwidth, or the SRAM capacity per core. The current work serves as an excellent proof-of-concept, but its conclusions are conditioned on the fixed compute die design.

Questions to Address In Rebuttal

On the Evaluator's Accuracy: Could the authors provide more detail on the validation of the DNN-based performance model used in the Evaluator (Section 4.6)? Specifically, what is the reported prediction error against a cycle-accurate or full-system simulation for key metrics like latency and throughput, and how does this potential error margin impact the confidence in identifying Case 3 as the definitive optimal architecture?
On the Generality of the Optimal Architecture: The conclusion that the balanced Case 3 architecture is optimal is a key result. How sensitive is this finding to the workload characteristics? For example, if faced with a workload composed exclusively of very long-context requests (e.g., summarizing entire books), which are heavily prefill-bound and generate massive KV caches, would the optimal point in the design space shift towards Case 4 (higher DRAM bandwidth/capacity)?
On Network Contention from the Memory Scheduler: The Memory Scheduler (Section 4.4) is a powerful concept that leverages the entire wafer's DRAM pool for KV cache. As multiple, distributed decoding instances access these remote KV cache blocks, this could create "hotspots" or significant contention on the 2D-mesh network. Is this cross-instance network traffic and potential congestion fully modeled within the Evaluator? And how does the system arbitrate or manage this contention in practice?

Review 3

Title: WSC-LLM: Efficient LLM Service and Architecture Co-exploration for Wafer-scale Chips Reviewer: The Innovator

Summary

The authors present WSC-LLM, a framework designed to co-explore hardware architecture parameters and software scheduling strategies for serving Large Language Models (LLMs) on wafer-scale chips. The framework aims to navigate the complex trade-offs between computation, memory capacity, and communication bandwidth inherent in such systems. It proposes strategies for resource partitioning, placement, and memory management tailored to the 2D-mesh topology of wafer-scale systems. The core claim of novelty rests on this co-exploration framework and its constituent algorithms for optimizing disaggregated LLM inference in this specific hardware context.

My analysis concludes that while the paper addresses a timely problem with a comprehensive engineering effort, its foundational novelty is limited. The primary contribution is the application-specific synthesis and refinement of existing concepts from distributed systems, scheduling, and hardware/software co-design, rather than the introduction of a fundamentally new algorithmic or architectural paradigm. The most notable novel component is a greedy algorithm for remote KV cache placement.

Strengths

Problem-Specific Algorithmic Contribution: The memory scheduling algorithm for KV cache placement (Algorithm 2, Page 8) is a concrete and novel contribution. While the concept of memory disaggregation is not new, this algorithm provides a specific, topology-aware greedy strategy to utilize distributed idle DRAM for KV cache, which is a key challenge in disaggregated LLM serving.
Holistic System Integration: The framework's strength lies in its comprehensive integration of multiple optimization layers (architectural DSE, instance partitioning, physical placement, memory management) into a single toolchain. While the concept of a co-design framework is not novel, creating a functional one for this complex domain is a non-trivial engineering achievement.
Context-Specific Heuristic: The "decoding-centered" resource placement strategy (Section 4.2.2, Page 6) is a simple and logical heuristic tailored to the producer-consumer dataflow of prefill and decoding phases on a 2D mesh. It is a novel, albeit incremental, heuristic for this specific problem.

Weaknesses

Limited Conceptual Novelty of the Core Framework: The central claim of a "co-exploration framework" is an overstatement of novelty. The field of hardware/software co-design has long used automated Design Space Exploration (DSE) to find optimal system configurations. WSC-LLM appears to be a well-executed but conceptually standard DSE framework applied to a new and important domain. It does not introduce a new theory or methodology for co-design itself.
Heavy Reliance on Assembling Prior Art: The framework's scheduling engine is built upon a foundation of well-established techniques from recent LLM serving literature, which the authors correctly cite. These include:
- Disaggregated Inference: The core idea of separating prefill and decoding resources was popularized by systems like Splitwise [62].
- Continuous Batching: A key technique from systems like Orca [91] and vLLM [45].
- Chunked Prefill: A strategy to handle long prompts, proposed in works like DeepSpeed-FastGen [10]. The novelty in WSC-LLM is the integration of these techniques for a wafer-scale target, not the techniques themselves.
Algorithmic Contributions are Largely Heuristic Search: The "Optimal Resource Partition Algorithm" (Algorithm 1, Page 7) is essentially a structured grid search over instance sizes and pre-defined Tensor Parallelism (TP) strategies. While systematic, this is a standard methodology for performance exploration and does not represent a novel algorithmic paradigm for optimization. It formalizes an exhaustive search within a constrained space.

Questions to Address In Rebuttal

The term "co-exploration framework" suggests a novel methodology. Can the authors precisely articulate what makes the WSC-LLM framework conceptually different from standard hardware/software Design Space Exploration (DSE) frameworks, beyond its application to the specific domain of wafer-scale LLM inference?
The decoding-centered placement strategy (Section 4.2.2) is a key contribution for handling the 2D-mesh topology. This producer-consumer placement problem (prefill instances producing KV cache for decoding consumers) is analogous to problems in NoC design and general parallel task mapping. Can the authors contrast their heuristic with prior work on topology-aware task placement for producer-consumer patterns and clarify the novelty?
The KV cache placement algorithm (Algorithm 2) is a greedy strategy. While this is the paper's strongest novel component, its simplicity warrants discussion. Could you elaborate on the conditions under which this greedy approach is sufficient and identify scenarios (e.g., highly fragmented memory, complex communication patterns) where it might lead to suboptimal placements compared to more complex graph-based or ILP formulations?
If you had to isolate the single most significant and novel conceptual contribution of this work, what would it be? Is it the framework itself, one of the specific algorithms, or the insight that moderate DRAM is optimal? Please be specific.

LightML: A Photonic Accelerator for Efficient General Purpose Machine Learning

Abstract

The rapid integration of AI technologies into everyday life across sectors such as healthcare, autonomous driving, and smart home applications requires extensive computational resources, placing strain on server infrastructure and incurring significant ...

ACM LINK

Reviews

Review 1

Reviewer: The Guardian

Summary

This paper introduces LightML, a photonic co-processor architecture designed for general-purpose machine learning acceleration. The authors claim to present the first complete "system-level" photonic crossbar design, including a dedicated memory and buffer architecture. The core of the accelerator is a photonic crossbar that performs matrix-matrix multiplication (MMM) using coherent light interference. The paper also proposes methods for implementing other necessary ML functions, such as element-wise operations and non-linear activation functions, directly on the photonic hardware. The headline claims are a peak performance of 325 TOP/s at only 3 watts of power and significant latency improvements (up to 4x) over an NVIDIA A100 GPU for certain models.

Strengths

The paper is well-written and addresses a compelling long-term research direction. The core strengths are:

Detailed Physical-Layer Analysis: The paper provides a thorough and convincing analysis of the physical error sources in the proposed photonic crossbar (Section 3.2, Pages 3-4). The modeling of noise from beam splitters, modulators, phase shifts, and detectors is detailed and grounded in prior work, lending credibility to the feasibility of the core computational unit.
Clear Articulation of Photonic Advantages: The authors do an excellent job of explaining why photonics is a promising substrate for computation, correctly identifying the fundamental advantages over electronic and resistive-memory-based crossbars, such as the ability to perform true MMM instead of MVM and the circumvention of slow, high-power reprogramming (Section 4, Page 4).

Weaknesses

While the physical foundation is solid, the paper's system-level claims are undermined by significant methodological flaws, questionable assumptions, and a disconnect between the simulated components and the real-world baseline comparisons.

Unsubstantiated Performance Claims due to Inequitable Comparison: The headline performance claims (e.g., 325 TOP/s, 4x shorter latency than an A100) are built on a foundation of inconsistent and unfair comparisons. LightML's performance is derived from a custom simulator using a 12 GHz clock frequency, while the baseline GPU and TPU measurements are from real hardware running at much lower frequencies (e.g., 765 MHz for the A100) (Table 4, Page 10). Furthermore, LightML is simulated using 5-bit integer precision (Int5), while the GPU/TPU baselines use 16-bit floating-point (FP16). Comparing a specialized, low-precision, high-frequency simulated accelerator to a general-purpose, high-precision, lower-frequency real-world chip is not a valid methodology for claiming superior performance. The reported speedups are more likely an artifact of these disparate parameters than a true architectural advantage.
Overstated "System-Level" Contribution and Unrealistic Buffer Architecture: The paper claims to be the "first system-level photonic crossbar architecture" with a "novel memory and buffer design" (Abstract, Page 1). However, the memory solution is not novel; it is a standard hierarchy of SRAM buffers whose parameters are estimated using CACTI. There is no evidence of a co-designed memory architecture that fundamentally addresses the unique data-delivery challenges of a photonic core. The claim that this buffer design achieves "over 80% utilization" (Abstract, Page 1 and Figure 13, Page 11) is not adequately supported. Figure 13 shows utilization for specific models but does not demonstrate that this holds for a general workload. It is highly likely that for many operations, particularly those that are not dense matrix multiplications, the photonic core would be severely data-starved by a conventional SRAM buffer, making the high utilization claim suspect.
Impractical Implementation of Non-Linear Functions: The proposal for implementing non-linear functions (e.g., Sigmoid, tanh) via Fourier Series using the existing optical modulators is theoretically interesting but practically flawed (Section 6.2, Page 6; Table 5, Page 11). The authors claim this requires "no extra hardware" and is therefore highly efficient. This is misleading. This method requires two full passes through the crossbar and multiple ADC readouts, which introduces significant latency and energy overhead that is not properly accounted for in the performance analysis. Claiming it has no area or power overhead (Table 5, Page 11) is incorrect because it consumes the primary compute resource for multiple cycles. This approach is far less efficient than a small, dedicated digital logic unit, as is standard in electronic accelerators.
Contradictory and Insufficient Precision Analysis: The paper makes conflicting claims about precision. It targets a conservative 5-bit precision (4-bit magnitude, 1-bit sign) for its primary design (Section 3.3, Page 4) and claims this results in a minimal 1-2% accuracy drop (Table 6). However, the error modeling itself shows that to achieve 5-bit precision, fabrication non-idealities must be controlled to an extremely high degree (e.g., $\delta_{k}^{2}<0.5/2^{5}$) (Section 3.3, Page 4). The paper hand-waves this away by appealing to "SOTA technology" and "one-time calibration" without providing sufficient evidence that such calibration is feasible or sufficient for a 128x128 array operating at 12 GHz. The claim that the accuracy drop is only 1-2% is based on injecting Gaussian noise during inference (Section 3.2, Page 4), which may not accurately model the complex, correlated noise profiles of a real photonic system.

Questions to Address In Rebuttal

Please justify your direct performance comparison to the NVIDIA A100. Provide a new comparison where the baseline is either simulated with the same high clock frequency and low precision as LightML, or where LightML is simulated with the clock speed and precision (FP16) of the A100.
Your claim of ">80% utilization" is based on a few selected models. Provide a more rigorous analysis of the memory and buffer architecture. What is the utilization for less ideal workloads, such as sparse matrix operations or models with many element-wise additions, and how does your "novel" buffer design specifically address the data-delivery challenges of a 12 GHz photonic core?
Please provide a detailed cycle- and energy-cost analysis for your proposed non-linear function implementation. How many cycles does it take to compute a single Sigmoid activation for a vector of 128 elements, and what is the total energy consumption, including all data movement and ADC conversions? How does this compare to a standard digital implementation?
The paper states that achieving the required precision depends on post-fabrication calibration (Section 3.3, Page 4). Can you provide evidence from prior work demonstrating that such per-unit-cell calibration is viable for a large (128x128) array and remains stable across varying temperatures and operating conditions at 12 GHz? Without this, the claimed precision is purely theoretical.

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper proposes LightML, an architecture for a photonic co-processor aimed at general-purpose machine learning acceleration. The central contribution is an attempt to define a complete system-level architecture around a photonic crossbar, which performs matrix-matrix multiplication (MMM) in the optical domain. The authors move beyond the core optical computation—which has been demonstrated in prior work—to address the surrounding necessary components, including a memory and buffering hierarchy and methods for implementing non-linear activation functions. The work positions itself as a bridge between foundational physics-level research and practical, high-performance ML accelerators, claiming significant performance and power efficiency gains over conventional electronic hardware like GPUs.

Strengths

This paper represents a valuable and necessary step in the evolution of photonic computing for AI. Its primary strength is its holistic, system-level perspective, which moves the conversation forward from isolated device physics to architectural reality.

Tackling the "System Problem": The field of photonic ML acceleration is maturing. Foundational work, such as Shen et al. (2017), demonstrated the feasibility of on-chip optical neural networks using Mach-Zehnder Interferometer (MZI) arrays. However, much of the subsequent research has focused on optimizing the core photonic devices. This paper correctly identifies that the next major hurdle is the "system problem": how do you feed data to, and get data from, these incredibly fast cores efficiently? By proposing and modeling a complete data path, including an SRAM buffer hierarchy (Section 5, Page 5), the paper connects the photonic core to the rest of the computing world. This is a crucial and often-overlooked aspect of analog and non-traditional accelerator design.
Pragmatic Approach to General-Purpose Operations: The paper thoughtfully considers how to implement the full suite of operations required for modern ML models, not just matrix multiplication. The proposed methods for element-wise operations (Section 6.1, Page 6) and, more ambitiously, non-linear functions using Fourier Series decomposition (Section 6.2, Page 6), show a commitment to creating a truly general-purpose accelerator. While the efficiency of these specific methods can be debated, the attempt to solve these problems within the photonic domain is a significant contribution that pushes the field to think beyond simple linear algebra.
Connects to Broader Analog Computing Trends: This work fits squarely within the broader landscape of research into analog and in-memory computing accelerators. Like resistive RAM (ReRAM) crossbars, LightML leverages a physical property (light interference) to perform computation in-place, promising massive energy savings by avoiding data movement. By detailing the challenges of noise, precision, and the analog-to-digital interface (Section 3, Page 3), this paper contributes to the shared knowledge base of this entire research thrust, and its findings will be relevant even to those working on non-photonic analog systems.

Weaknesses

While the system-level ambition is a strength, the paper's current weaknesses stem from a disconnect between its theoretical potential and the practical realities of both hardware implementation and software integration.

The "Memory Wall" Reappears in a New Form: The paper proposes a standard SRAM buffer hierarchy to feed its 12 GHz photonic core. However, the analysis does not fully grapple with the immense data-delivery challenge this creates. The core can consume data far faster than conventional SRAM, even in a stacked configuration, can provide it. This creates a new kind of "memory wall," where the incredible speed of the photonic computation is likely to be bottlenecked by the electronic memory system. The high utilization figures reported (Figure 13, Page 11) are for specific, dense models and may not reflect the reality of more diverse or sparse workloads, where data starvation could become a dominant issue.
The Software/Compiler Challenge is Understated: The paper focuses on the hardware architecture but does not deeply address the software and compiler stack required to make such an accelerator usable. Mapping complex modern models (like Transformers) onto this architecture, managing the limited precision, and deciding when to use the inefficient on-chip non-linear functions versus offloading to a digital co-processor are all non-trivial compiler problems. Without a clear path to a programmable software stack, LightML remains more of a specialized engine than a "general purpose" accelerator.
Positioning Relative to Commercial Efforts: The academic landscape is not the only context. Companies like Lightmatter are already building and shipping silicon photonics products for AI. While their architecture is different (focusing more on interconnects), a discussion of how LightML's "all-in-one" compute-and-memory approach compares to these commercial strategies would provide valuable context and highlight the paper's unique contributions more clearly.

Questions to Address In Rebuttal

Your work takes an important step towards a full system design. Could you elaborate on the long-term vision for the memory architecture? How can the data delivery gap between the electronic SRAM and the photonic core be sustainably bridged as both technologies scale?
Thinking about the software stack, how would a compiler decide to map operations onto LightML? For instance, for a given model, how would it weigh the cost of using the on-chip Fourier-based ReLU against the latency and energy cost of sending the data off-chip to a digital unit?
The paper focuses on a "co-processor" model. Can you discuss the future integration of LightML with a host CPU? What would be the ideal interface (e.g., PCIe, CXL), and what new challenges would arise in a tightly-coupled heterogeneous system?
How do you see the trade-offs made in LightML (e.g., low precision, complex non-linearities) influencing the design of future machine learning models? Could this architecture drive the development of new, "photonics-friendly" model architectures that are inherently more robust to analog noise and rely more on efficient linear operations?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present LightML, an architecture for a photonic co-processor. The paper's primary claims to novelty are: 1) Being the "first system-level photonic crossbar architecture" by including a co-designed memory and buffer system (Abstract, Page 1), and 2) A novel method for implementing non-linear activation functions directly in the optical domain by approximating them with a Fourier Series, which they claim requires "no extra hardware" (Abstract, Page 1; Section 6.2, Page 6). The architecture is based on a photonic crossbar that performs matrix-matrix multiplication (MMM) using coherent light interference in an array of Mach-Zehnder interferometers.

Strengths

From a novelty standpoint, the paper's strength lies in its attempt to synthesize existing concepts into a more complete system, rather than inventing entirely new foundational components.

System-Level Integration as a Novelty Claim: The primary novel contribution is the argument for a "system-level" design. While individual components have been demonstrated before, this work is among the first in the academic literature to explicitly model and present an end-to-end architecture that includes the photonic core, ADCs/DACs, and a multi-level SRAM buffer hierarchy (Section 5, Page 5). The "delta" here is the integration itself—proposing a complete blueprint where prior work often focused on demonstrating a single component in isolation.

Weaknesses

While the system-level claim is a step forward, the novelty of the underlying components and techniques is highly questionable when examined against prior art.

Core Computational Method is Well-Established: The fundamental concept of using a mesh of Mach-Zehnder interferometers (MZIs) to perform matrix multiplication is not new. This was famously demonstrated by Shen et al., Nature Photonics (2017), which laid the groundwork for this entire line of research. The use of homodyne detection for optical MAC operations is also a known technique. LightML uses this established foundation (Section 2, Page 2), and while the specific array size and configuration are implementation details, the core computational principle is not a novel contribution of this paper.
"Novel" Memory Architecture is Standard Practice: The paper claims a "novel memory and buffer design" (Abstract, Page 1). However, the described architecture is a standard two-level hierarchy consisting of off-chip HBM and on-chip SRAM buffers (Section 5, Page 5). This is a conventional memory architecture used in virtually all modern digital accelerators (GPUs, TPUs). There is no novel mechanism presented that is uniquely tailored to the challenges of photonic data delivery. The use of double-buffering (Section 5.2, Page 6) is a standard technique to hide latency and is not a novel invention of this work. Therefore, the claim of a novel memory architecture is unsubstantiated.
Non-Linear Function Implementation is Not Novel and Impractical: The paper proposes using the Fourier Series to implement non-linear functions by leveraging the inherent sine/cosine response of the phase modulators (Section 6.2, Page 6). This idea is not entirely new; the general concept of function approximation via basis functions is a staple of signal processing. More importantly, the claim that this requires "no extra hardware" is misleading. The implementation described requires two full passes through the crossbar and multiple ADC conversions: one pass to calculate the intermediate products of the input with the Fourier frequencies, and a second pass to multiply those results by the Fourier coefficients. This consumes the primary, most powerful computational resource in the entire accelerator to perform what is typically a simple, low-cost operation in digital logic. This is not a novel, efficient solution but rather a highly inefficient re-purposing of existing hardware that incurs significant latency and energy overhead, which is not adequately accounted for in the analysis.
Incremental Advance, Not a Breakthrough: When viewed in the context of the field, LightML appears to be an incremental, academic synthesis of pre-existing ideas. The core computation is from Shen et al., the concept of system-level photonic integration is being actively pursued commercially by companies like Lightmatter, and the proposed "novel" solutions for memory and non-linearities are either standard digital designs or inefficient adaptations. The performance gains reported are largely a product of comparing a highly-specialized, low-precision (5-bit) simulated design (Section 3.3, Page 4) running at an optimistic 12 GHz to general-purpose, high-precision (FP16) real-world hardware running at much lower clock speeds (Table 4, Page 10), which is not a valid basis for claiming a novel performance breakthrough.

Questions to Address In Rebuttal

Please clarify the novelty of your memory architecture (Section 5, Page 5). Given that it is a standard HBM + SRAM hierarchy, what specific mechanism or design choice is fundamentally new and not considered standard practice in digital accelerator design?
Can you defend the claim that your Fourier-based non-linear function implementation (Section 6.2, Page 6) has "no extra hardware overhead"? Please provide a detailed breakdown of the cycle count and energy consumption for this operation, including all data movement to and from the crossbar and all ADC/DAC conversions, and compare it to the cost of a small, dedicated digital ALU for the same function.
The foundational work of Shen et al. (2017) demonstrated optical matrix multiplication with MZI arrays. What is the fundamental, conceptual "delta" of the LightML computational core (Section 2, Page 2) that you would consider a significant advance over this and other subsequent works that have used the same principle?
How does your proposed "system-level" architecture differ from the integrated photonic systems being developed commercially by companies like Lightmatter? What is the unique architectural insight in LightML that is not already being pursued in industry?

FRED: A Wafer-scale Fabric for 3D Parallel DNN Training

Abstract

Wafer- scale systems are an emerging technology that tightly integrates high-end accelerator chiplets with high-speed wafer-scale interconnects, enabling low-latency and high-bandwidth connectivity. This makes them a promising platform for deep neural ...

ACM LINK

Reviews

Review 1

Reviewer: The Guardian

Summary

This paper identifies the communication inefficiencies of standard 2D Mesh interconnects for supporting the complex, multi-dimensional parallelism strategies (e.g., 3D parallelism) used in large-scale DNN training. To address this, the authors propose FRED, a wafer-scale fabric architecture. FRED's design is a hierarchical, switch-based topology (specifically, a fat-tree) composed of small, recursive "FRED Switches" that feature in-network collective support. The authors claim that by providing flexible, non-blocking connectivity, FRED significantly improves end-to-end training time—by up to 1.87x—for large models compared to a conventional 2D Mesh on a wafer-scale substrate.

Strengths

The paper is well-motivated and attacks a clear and relevant problem in the design of next-generation training hardware.

Strong Problem Motivation: The paper does an excellent job of systematically breaking down the communication challenges of a 2D Mesh when faced with 3D parallelism (Section 3.2, Pages 4-5). The analysis of issues like I/O hotspotting (Figure 4, Page 5), the mathematical impossibility of optimally mapping 3D logical groups onto a 2D physical grid (Figure 5, Page 5), and the difficulties with non-aligned parallelism strategies (Figure 6, Page 6) is thorough and convincing. This provides a solid foundation for the necessity of a more flexible fabric.
Sound Architectural Concept: The high-level architectural choice of a hierarchical, switched fabric (specifically, an almost fat-tree) is a logical and well-reasoned solution to the problems identified with the 2D Mesh. A switched topology inherently provides greater path diversity and higher bisection bandwidth, which directly addresses the congestion and mapping limitations of a mesh.

Weaknesses

Despite the clear motivation, the paper's central claims are undermined by a flawed evaluation methodology, questionable baseline comparisons, and an oversimplification of critical system-level challenges.

Unsubstantiated Performance Claims due to Unrealistic Baseline: The headline performance improvements are built on a comparison between the proposed FRED-D architecture and a baseline 2D Mesh that is not equitable. FRED-D is simulated with a bisection bandwidth of 30 TBps, whereas the baseline Mesh is limited to 3.75 TBps (Table 5, Page 10). The reported speedups are therefore less a testament to FRED's architectural novelty and more a predictable outcome of providing the network with 8x more raw bandwidth. The paper attempts an apples-to-apples comparison with FRED-A (which has the same 3.75 TBps bisection bandwidth), but this configuration shows minimal to no performance benefit, thus invalidating the headline claims. The improvements come from bandwidth, not from the specific FRED architecture itself.
Simulation Fidelity is Questionable: The entire evaluation rests on the ASTRA-SIM framework (Section 7.4, Page 10). While ASTRA-SIM is a useful tool for high-level performance projection, the paper provides no evidence that it accurately models the detailed, low-level network dynamics of a wafer-scale fabric. For example, there is no discussion of how the simulation models the latency of the very long wafer-scale links, the complex routing decisions within the FRED microswitches, or potential physical-layer effects at these high bandwidths. Without calibration against a more detailed network simulator (like Garnet or a commercial tool) or real hardware, the quantitative results lack credibility.
Physical Implementation Challenges are Understated: The paper acknowledges that the required switch chiplets would be large but dismisses the area overhead by appealing to "unclaimed area on the wafer" and future I/O technologies (Section 6.2.3, Pages 9-10). This is a significant oversimplification. The proposed design requires a massive number of long, high-bandwidth, wafer-scale interconnects between the NPU chiplets and the L1 FRED switches, and then again between the L1 and L2 switches (Figure 8, Page 8). The routing complexity, wiring density, and potential for signal integrity issues in such a design are immense and are not adequately addressed. Claiming this can be implemented by simply using "unclaimed area" ignores the profound physical design challenges.
In-Network Compute is Not a Free Lunch: A key feature of FRED is its in-network collective execution. However, the paper provides insufficient detail on the hardware cost and complexity of the "R-uSwitch" and "D-uSwitch" components that perform these reductions (Figure 7, Page 7). These operations, especially on floating-point data, are not trivial. The paper presents area and power numbers post-layout (Table 4, Page 9), but does not provide a breakdown of how much of that is attributable to the in-network compute logic versus the standard switching and buffering. This makes it impossible to evaluate the true cost-benefit trade-off of this feature.

Questions to Address In Rebuttal

Please justify your primary comparison between FRED-D (30 TBps bisection) and the baseline Mesh (3.75 TBps bisection). To isolate the architectural benefits of FRED, please provide a direct comparison of end-to-end training time between the baseline Mesh and a Mesh that is also provisioned with an equivalent 30 TBps bisection bandwidth.
Can you provide more detail on the validation of your ASTRA-SIM model? Specifically, how did you model the latency and energy of the multi-centimeter wafer-scale links, and how were the internal micro-architectural delays of the FRED switches determined and validated?
The proposed physical layout in Figure 8 requires an extremely complex and dense wiring scheme. Have you performed a routability analysis for this design? What is the estimated total wire length on the wafer, and what are the associated energy costs for driving signals across these long wires, which your current power analysis seems to aggregate into a single "Additional Wafer-Scale Wiring" number?
Please provide a detailed microarchitectural description of the reduction/distribution units (R-uSwitch/D-uSwitch). What is the area and power cost specifically for the FP16 arithmetic logic within these switches, separate from the standard switching logic? How does this design handle potential floating-point exceptions or the need for different rounding modes during in-network reduction?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper identifies a critical architectural challenge in large-scale DNN training: the mismatch between the communication patterns of advanced 3D parallelism (Data, Pipeline, and Tensor) and the rigid topology of a conventional 2D Mesh interconnect. The authors propose FRED, a hierarchical, switch-based wafer-scale fabric designed to provide the flexible, high-bandwidth connectivity required by these complex training strategies. FRED is built from small, recursive switch units ("FRED Switches") that support in-network collective operations and are arranged in a fat-tree-like topology. The work's central thesis is that moving beyond the simple 2D Mesh to a more sophisticated fabric is essential for unlocking the full potential of wafer-scale training hardware for next-generation AI models.

Strengths

This paper is a valuable contribution to the field because it thoughtfully connects several key trends in high-performance computing and applies them to the specific domain of wafer-scale AI accelerators.

Excellent Problem Synthesis: The paper does a superb job of synthesizing the state of the art in both software (3D parallelism) and hardware (wafer-scale integration) to identify a crucial point of friction. It correctly observes that as training software becomes more sophisticated, the underlying hardware interconnect, which has historically been a simple mesh for both academic proposals and commercial systems like Google's TPUs, becomes a major bottleneck. This clear and timely problem statement (Section 3, Pages 3-6) is the paper's greatest strength.
Leveraging Established HPC Principles: The proposed solution, a switched fat-tree network, is not a radical invention but a well-reasoned application of established principles from the High-Performance Computing (HPC) world. Fat-tree topologies have long been the standard for large-scale supercomputers precisely because they provide high bisection bandwidth and flexible, non-blocking communication. By adapting this concept to a wafer-scale implementation, the paper bridges the gap between the on-chip (NoC) and data-center-scale networking worlds, creating a compelling vision for a "data center on a wafer." This is a logical and powerful architectural evolution.
Connecting to Commercial Realities: The FRED architecture, in concept, closely mirrors the direction the industry is heading. NVIDIA's DGX systems, for example, do not connect GPUs in a simple mesh but use dedicated, high-radix NVSwitch chips to create a non-blocking fabric for all-to-all communication. FRED can be seen as the wafer-scale analogue of an NVSwitch-based system. By proposing an on-wafer fabric with in-network collectives, this work provides a forward-looking academic blueprint that aligns with the architectural principles of today's most powerful commercial training systems.

Weaknesses

While the high-level vision is strong, the paper could do more to situate its specific design within the rich landscape of existing work and address the immense practical challenges of its proposal.

Insufficient Engagement with Prior Wafer-Scale Work: The paper positions itself as a novel solution for wafer-scale systems but does not deeply engage with the design choices of existing wafer-scale pioneers, most notably Cerebras. The Cerebras architecture uses a 2D mesh interconnect, but pairs it with a unique "weight streaming" execution model that differs significantly from traditional GPU-style training. A more thorough analysis would compare FRED not just to a generic mesh, but would discuss why the Cerebras approach is insufficient for 3D parallelism and how FRED's flexibility enables a broader range of training paradigms that Cerebras may not support well.
The "How" of Physical Implementation: The paper focuses on the logical topology of FRED but is light on the details of the physical implementation, which is a major challenge for wafer-scale systems. Building a multi-level fat-tree on a 2D wafer requires extremely long and dense global wiring, which introduces significant latency, power, and signal integrity issues. While the paper acknowledges this (Section 6.2.3, Page 9), it would be strengthened by a more detailed discussion of the physical design challenges and a comparison to alternative physical topologies, such as folded torus or dragonfly networks, which are also used in HPC to reduce wiring length.
Under-explored Design Space for In-Network Compute: The idea of in-network collectives is powerful. However, the paper presents a single data point (a specific "uSwitch" design). The design space is much richer. For example, what is the trade-off between having more, simpler switches versus fewer, more complex switches with more powerful reduction units? How does the choice of number format (e.g., FP16 vs. BFloat16 vs. block floating point) for the in-network math affect the area, power, and accuracy of the collectives? Exploring these trade-offs would provide a more comprehensive guide for future architects.

Questions to Address In Rebuttal

Could you elaborate on the comparison between FRED and an NVSwitch-based system? What are the unique challenges and opportunities of implementing a switched fabric on a monolithic wafer compared to connecting discrete GPUs with external switch chips?
Your work rightly criticizes the 2D Mesh. However, Google's TPU pods have successfully used a 2D/3D Torus interconnect for years. What specific limitations of the Torus topology, beyond those of a simple Mesh, does FRED address that justifies the significant increase in architectural complexity?
The paper focuses on a fat-tree. Could you discuss why you chose this topology over other high-bandwidth HPC topologies like Dragonfly or a hypercube? What are the relative advantages and disadvantages of these alternatives in the context of a 2D wafer implementation?
Looking forward, how do you see a fabric like FRED enabling future DNN training paradigms beyond the current 3D parallelism model? Could the flexible, low-latency connectivity be exploited for more dynamic, fine-grained, or irregular parallelism strategies?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors identify the inefficiency of 2D Mesh interconnects for 3D DNN parallelism and propose FRED, a wafer-scale interconnect fabric. [cite_start]The core novel claim is the architectural synthesis of applying a hierarchical, switched fat-tree topology, complete with support for in-network collective computations, to the specific physical substrate of a monolithic wafer for the purpose of accelerating DNN training (Abstract, Page 1; Section 1, Page 2)[cite: 4, 21]. The proposed "FRED Switch" is a recursive micro-switch design intended to be the building block of this fabric (Section 4, Page 6).

Strengths

From a novelty perspective, the contribution of this paper is not in the invention of new foundational primitives, but in the novel adaptation and synthesis of existing, powerful concepts into a new domain.

Novel Architectural Shift for Wafer-Scale Systems: The primary "delta" of this work is moving the architectural conversation for wafer-scale interconnects beyond the simple 2D Mesh/Torus topologies that have dominated both academic proposals and commercial systems like the Google TPU and Cerebras WSE. By proposing a hierarchical, switched fat-tree (Section 4, Page 6), the paper introduces a fundamentally different and more flexible network paradigm to this domain. This application of a well-understood HPC topology to the unique constraints of a wafer is a significant and novel conceptual step.

Weaknesses

While the synthesis is novel, the work's claims are diluted because the foundational building blocks are well-established prior art. The paper does not invent new concepts so much as it re-packages existing ones.

Core Topology is Not New: A fat-tree network is a canonical topology in high-performance computing (HPC) and data center networking, and its benefits for providing high bisection bandwidth are known for decades. The novelty here is purely in the application to a wafer, not in the invention of the topology itself.
In-Network Compute is Prior Art: The idea of offloading collective operations (like reductions) into the network fabric is a well-established technique in the HPC community, commercialized in technologies like NVIDIA's SHARP (Scalable Hierarchical Aggregation and Reduction Protocol). [cite_start]The paper acknowledges this prior art (Section 2.2, Page 3)[cite: 35, 57]. The "R-uSwitch" and "D-uSwitch" (Figure 7, Page 7) are a new implementation of this old idea, but the concept itself is not a novel contribution of this paper.
Functionally Similar to Existing Commercial Fabrics: At a conceptual level, FRED is a wafer-scale implementation of the same design philosophy embodied by NVIDIA's NVSwitch. NVSwitch uses dedicated switch chips to create a non-blocking fabric that provides all-to-all connectivity between GPUs in a server. FRED scales this idea to a wafer with hundreds of processing elements. While the physical implementation challenges are different, the core architectural pattern—using a switched fabric to overcome the limitations of direct point-to-point connections—is functionally identical to existing, commercially available technology.
[cite_start]Performance Gains Are Not a Novel Insight: The headline performance improvements (up to 1.87x) are almost entirely derived from comparing a FRED-D configuration with 30 TBps of bisection bandwidth to a baseline Mesh with only 3.75 TBps (Table 5, Page 10)[cite: 201, 242]. It is not a novel discovery that an interconnect with 8x more bandwidth performs better. [cite_start]The architecturally equivalent comparison, FRED-A, shows negligible improvement, indicating that the novelty of the FRED topology itself provides little performance benefit over a basic Mesh when bandwidth is held constant (Table 5, Page 10)[cite: 244]. The novelty claim cannot rest on an artifact of an unfair comparison.

Questions to Address In Rebuttal

The architectural pattern of FRED is conceptually similar to NVIDIA's NVSwitch fabric. What is the fundamental, novel insight of your work beyond scaling the NVSwitch concept to a wafer-level implementation? What new, non-obvious problems did you identify and solve that are unique to the on-wafer context?
Fat-tree networks and in-network collectives are staples of the HPC field. What is the specific, novel "delta" in your implementation of these concepts (Section 4, Page 6; Section 6.1, Page 7) that you believe constitutes a significant advancement over prior art, beyond simply being implemented on a wafer?
[cite_start]The paper claims to be a "new fabric" (Section 1, Page 2)[cite: 25], but the performance gains seem to come from increased bandwidth rather than the novelty of the topology itself (as shown by the FRED-A results in Table 5, Page 10). Can you defend the novelty of the FRED architecture in light of the fact that, at equivalent bandwidth, it provides minimal benefit over a standard Mesh?
Given that a fat-tree requires complex, long-range global wiring on a 2D substrate (Figure 8, Page 8), what is the novel contribution of FRED in solving the physical design and routability challenges, a problem that has historically led designers of on-chip networks to prefer simpler mesh-like topologies?

PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer

Abstract

As cluster scales for LLM training expand, waferscale chips, characterized by the high integration density and bandwidth, emerge as a promising approach to enhancing training performance. The role of Network on Wafer (NoW) is becoming increasingly ...

ACM LINK

Reviews

Review 1

Paper Title: PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer Reviewer Persona: The Guardian (Adversarial Skeptic)

Summary

The authors propose a co-design methodology for developing networks on wafers (NoW) for LLM training. The central claim is that existing approaches design physical and logical topologies in isolation, leading to suboptimal performance. To address this, they introduce a "mesh-switch" physical topology, which aims to balance the high integration density of mesh with the communication efficiency of switch-based networks like fat-trees. A Design Space Exploration (DSE) is conducted to find an optimal configuration under physical design (PD) constraints. Building on this physical topology, they propose a "dual-granularity" logical topology and fine-grained communication scheduling. The authors claim their final design achieves up to a 2.39x performance improvement over a baseline mesh network.

Strengths

Problem Formulation: The paper correctly identifies a fundamental tension in waferscale design: the trade-off between allocating area for computation versus communication. The motivation to move beyond pure mesh or pure fat-tree (like FRED) topologies is well-founded and addresses a genuine challenge in the field.
Constraint-Awareness: The explicit inclusion of PD constraints, such as maximum interconnect length (50mm) and wiring density (3 metal layers), is a crucial step toward practical waferscale architecture design. This grounds the exploration in realistic physical limitations.
Holistic Scope: The work attempts to connect multiple layers of the system stack, from the physical layout (mesh-switch) to logical communication patterns (tree+ring) and finally to parallelism strategies (topology-aware sharding). This breadth is commendable.

Weaknesses

My primary concerns revolve around the methodological rigor of the DSE, the validity of the baselines used for comparison, and the overstatement of key claims.

Superficial DSE Methodology: The core of the physical topology design rests on a Design Space Exploration (DSE) process (Section 5.3, page 7). However, this DSE is driven by an analytical model (Eqs. 1-4) for area estimation, which the authors admit has an "average error below 15%". For a process intended to find a single "optimal" configuration, relying on a model with such a significant and uncharacterized error margin is concerning. The optimality of the 2x2 mesh group is not rigorously proven but is rather a product of this potentially inaccurate model. The claim that this is computationally necessary (0.67ms vs. 24 hours) does not excuse the potential for the model to lead the DSE to a local, or even incorrect, optimum.
Unsupported Generality of the "Optimal" Configuration: "Key Insight 3" (page 8) boldly claims that "the DSE for mesh-switch physical topology consistently converges to 2x2 mesh group configuration" across various hardware and applications. This is a significant overstatement. The supporting evidence in Figures 11 and 12, while showing a peak at 2x2 for the tested configurations, is insufficient to establish such a universal design rule. The result is likely highly sensitive to the specific parameters in Table 2 (e.g., the relative area and performance of compute vs. switch dies). A different set of assumptions could easily shift the optimal point. The claim of consistent convergence is not substantiated.
Flawed Performance Breakdown and Baseline Selection: The headline claim of a 2.39x improvement is derived from a breakdown analysis in Figure 24 (page 13). This analysis is methodologically unsound. The baseline for the entire stack is Mesh+Ring. However, the authors acknowledge in Section 8.1 that the TTO logical topology is superior for Mesh. By comparing their logical topology against the weaker "Ring" topology, they artificially inflate the contribution of their "Logical" step (a 1.42x gain). A rigorous comparison would evaluate the full proposed system (MS+Logical+Para) against the strongest possible baseline system (e.g., Mesh+TTO with optimized parallelism). The current breakdown appears engineered to maximize the reported gains.
Insufficient Justification for Topology Exclusion: The paper dismisses other SOTA network topologies like Dragonfly and Flattened Butterfly based on wiring density and signal integrity arguments (Section 5.1 and Figure 9). This justification is weak. Figure 9c presents signal loss in dB without defining an acceptable threshold or referencing a specific signaling technology's budget (e.g., a required signal-to-noise ratio). Without a quantitative and well-motivated threshold, the argument that these topologies are infeasible remains an assertion, not a proven fact. This weakens the paper's central claim that mesh-switch is the superior choice among viable alternatives.
The "Co-Design" Framework is a Re-labeling of a Sequential Process: The "TickTock Framework" is presented as a novel co-design methodology. In practice, as described, it appears to be a sequential process: first, the DSE is used to fix the physical topology, and then a compatible logical topology is designed for it. There is no evidence of a feedback loop where findings from the logical topology design phase inform or cause a revision of the physical topology. Without this iterative feedback, this is not "co-design" but rather a well-structured sequential design flow.

Questions to Address In Rebuttal

On DSE Validity: Can you provide a sensitivity analysis showing how the optimal mesh group configuration (2x2) changes with variations in the key DSE parameters (e.g., compute die area, switch die area/bandwidth)? How do you ensure the 15% error in your analytical model does not lead you to a suboptimal configuration?
On Performance Claims: Please provide a revised performance breakdown (as in Figure 24) where the baseline is the strongest possible configuration for Mesh (i.e., using the TTO logical topology you cite as superior). How does this affect the reported 1.42x gain from the logical topology and the final 2.39x overall gain?
On Topology Exclusion: Can you provide a quantitative signal integrity budget (e.g., maximum acceptable insertion loss in dB at the target frequency) for the assumed waferscale interconnect technology? Please show explicitly how topologies like Dragonfly violate this specific, technically-grounded budget.
On the Co-Design Claim: Can you point to a specific instance in your methodology where the design of the logical topology or parallelism strategy forced a re-evaluation and modification of the physical topology? If not, please justify the use of the term "co-design" over "sequential, constraint-aware design."

Review 2

Paper Title: PD Constraint-aware Physical/Logical Topology Co-Design for Network on Wafer Reviewer Persona: The Synthesizer (Contextual Analyst)

Summary

This paper presents a holistic co-design framework for creating high-performance Networks-on-Wafer (NoW) specifically tailored for Large Language Model (LLM) training. The authors' central thesis is that prior work has suffered from "orphan designs," optimizing either the physical topology (e.g., Mesh) or the logical topology (e.g., routing algorithms) in isolation, thereby hitting a premature performance ceiling.

To address this, the authors propose a comprehensive methodology that bridges this gap. Their core contributions are threefold: 1. A hybrid "mesh-switch" physical topology that balances the high compute density of mesh networks with the superior communication performance of switched fat-tree networks. 2. A Physical Design (PD) constraint-aware Design Space Exploration (DSE) algorithm that systematically searches for the optimal configuration of this topology, grounded in realistic wafer-level constraints like area, power, and maximum interconnect length. 3. A "dual-granularity" logical topology and associated collective algorithms designed specifically to leverage the hierarchical nature of their proposed physical network.

Through simulation on a range of LLM workloads, the authors demonstrate that their co-designed NoW achieves up to a 2.39x performance improvement over existing state-of-the-art mesh-based designs.

Strengths

This work's primary strength lies in its conceptual framing and methodological rigor. It correctly identifies a fundamental weakness in the design of large-scale, monolithic systems and proposes a compelling, systematic solution.

Excellent Problem Formulation: The paper's core insight—that physical and logical network topologies for waferscale systems must be co-designed to unlock their full potential—is both powerful and timely. The "ticktock" analogy presented in Figure 1 (Page 2) is an effective visualization of this concept, clearly articulating why previous "orphan designs" are insufficient. This positions the work not as a simple point solution, but as a new way of thinking about the problem.
A Principled Hybrid Solution: The proposed "mesh-switch" topology is a well-reasoned middle ground. It recognizes the fundamental trade-off between computation and communication resources on a constrained silicon area, a lesson well-established in the HPC community. By analyzing the shortcomings of pure Mesh (communication-bound) and pure fat-tree like FRED (compute-bound, as shown in Figure 5, Page 4), the authors motivate their hybrid approach not just by intuition, but with clear data.
Grounding in Physical Reality: The most significant contribution, from a systems perspective, is the "PD Constraint-aware" DSE. Many academic papers on network topologies remain in the abstract. This work grounds its exploration in concrete physical limitations (e.g., the 50mm D2D link distance constraint mentioned in Section 2.2, Page 3), lending the results significant credibility. This methodology provides a valuable blueprint for future research and industrial development in this space.
Connecting the Full Stack: The paper successfully connects the dots from the lowest level of physical constraints up to the highest level of workload performance. The framework spans physical topology, logical topology, collective algorithms, and even parallelism/sharding strategies (Section 7, Page 10). This full-stack view is rare and extremely valuable, showing a deep understanding of how decisions at one level of the system cascade and affect others. The analysis showing why a technique like Ulysses is not a simple drop-in replacement for mesh-based groups (Figure 19c, Page 11) is a prime example of this nuanced, topology-aware approach.

Weaknesses

While the core idea and methodology are strong, the work could be further contextualized and strengthened by addressing some practical systems-level challenges inherent to its chosen domain.

Absence of Fault Tolerance and Yield Considerations: Waferscale integration is fundamentally a battle against manufacturing defects. The proposed design, with its reliance on a centralized set of switches, appears to have a significant single point of failure. The paper does not discuss how the topology or routing would adapt to the inevitable presence of dead compute dies, broken links, or faulty switch components. For a waferscale architecture to be viable, a strategy for gracefully degrading performance or routing around faults is not just a feature, but a necessity. This is a well-trodden area in both HPC and NoC research that this work would benefit from engaging with.
Scalability of the Centralized Switch Model: The DSE convincingly finds an optimal configuration for the target wafer size. However, the reliance on a single, logically centralized switch could become a bottleneck as substrate sizes scale further (e.g., to the 300,000 mm² glass panels mentioned in Section 5.6, Page 8). A brief discussion on the architectural evolution—perhaps towards a hierarchical or multi-level switch fabric for even larger systems—would strengthen the paper's claims of scalability.
Generalizability Beyond LLM Workloads: The work is laser-focused on LLM training, which is appropriate given its importance. However, waferscale systems have potential applications across HPC, scientific computing, and graph analytics, which feature different communication patterns (e.g., heavy nearest-neighbor, sparse all-to-all). A short discussion on how the optimal "mesh-switch" configuration might change for these workloads would broaden the paper's impact and highlight the flexibility of the underlying DSE framework.

Questions to Address In Rebuttal

The authors are encouraged to use the rebuttal to address the following points, which would significantly strengthen the paper's contribution:

Could you elaborate on the fault tolerance of the proposed mesh-switch architecture? How does the system handle manufacturing defects within a mesh group or, more critically, within the central switch fabric? Are there provisions for redundant paths or disabling faulty components?
Your DSE identifies the 2x2 mesh group as optimal for the evaluated configurations. How sensitive is this result to the communication patterns of the workload? For instance, in an HPC application dominated by stencil computations (nearest-neighbor traffic), would the DSE favor larger mesh groups and a smaller central switch?
Regarding the centralized switch, what do you foresee as the primary scaling limiter? Is it the bisection bandwidth of the switch itself, the physical area required, or the complexity of wiring connections from all mesh groups to a central location?

Review 3

Review Form: The Innovator

Summary

This paper proposes a framework for co-designing the physical and logical topologies for a Network on Wafer (NoW), specifically for training Large Language Models (LLMs). The authors contend that prior works have treated physical and logical topology design as separate ("orphan") problems. Their core contribution is a methodology that performs a joint optimization under explicit Physical Design (PD) constraints, such as maximum interconnect length and wiring density. This methodology leads them to propose a novel hybrid physical topology called "mesh-switch," which combines dense mesh compute groups with a central, fully-connected switch. Based on this physical layout, they design a corresponding "dual-granularity" logical topology. The authors employ a Design Space Exploration (DSE) algorithm to find the optimal parameters for their mesh-switch architecture, ultimately demonstrating significant performance gains over existing mesh and fat-tree NoW designs.

Strengths

From a novelty perspective, the paper's primary strength lies in its synthesis of existing concepts into a new, constrained optimization problem specific to waferscale integration.

Formalization of a Constrained Co-Design Problem: The most significant novel contribution is the rigorous formulation of the NoW topology design problem under realistic, hard physical constraints (Section 2.2, page 3; Section 5.1, page 6). While hardware-software co-design is a known field, this paper's sharp focus on the unique PD constraints of waferscale systems (e.g., the 50mm link length limit, 3-metal-layer wiring density) provides a novel and practical lens through which to evaluate architectural trade-offs. This moves the discussion beyond abstract graph topologies to physically-realizable designs.
A Novel Hybrid Physical Topology: The proposed "mesh-switch" architecture is a genuinely new data point in the NoW design space. It is a well-motivated hybrid that explicitly attempts to combine the high compute density of mesh architectures (e.g., Cerebras, Dojo) with the superior communication performance of switched fat-tree networks (e.g., FRED [78]). This hybrid is not an arbitrary choice but a direct outcome of their analysis of the computation vs. communication trade-off, making it a well-justified architectural proposal.
Methodological Novelty in DSE: The development and application of a fast DSE algorithm (Section 5.3, page 7) to navigate the design space of this new hybrid topology is a valuable contribution. It provides a concrete, repeatable methodology for finding an optimal configuration (identified as 2x2 mesh groups), lending credibility and practicality to their proposed framework.

Weaknesses

The paper's claims of novelty are, in some areas, overstated, as several core concepts are reformulations of long-established principles in parallel computing and architecture.

Incremental Nature of "Co-Design": The central premise of physical/logical "co-design" is not new. This principle is foundational in HPC, where logical communication patterns (e.g., MPI collectives) are routinely optimized for specific physical interconnects (e.g., fat-trees, dragonflies). The paper cites work like Huang et al. [39] which explicitly covers "Communication algorithm-architecture co-design." The novelty here is therefore not the concept of co-design itself, but its specific application to the PD-constrained waferscale domain. The framing could be more precise about this delta.
The "Dual-Granularity Logical Topology" is a Consequence, Not an Invention: The proposed logical topology (Section 6, page 9) is a direct and rather obvious consequence of the hierarchical physical topology. A physical design with local mesh groups and a global switch naturally mandates a hierarchical routing scheme (e.g., XY routing within the group, direct hop via the switch between groups). Similarly, optimizing collectives via a hierarchical algorithm (e.g., tree+ring) is standard practice for hierarchical hardware. The conceptual leap in the logical topology design is therefore minimal; it is the necessary software mapping for the novel hardware, not a standalone innovation.
The "TickTock Framework" is Re-labeling: The "ticktock framework" presented in Figure 1 (page 2) is a compelling visual but does not represent a fundamentally new design methodology. It is an illustration of iterative design, a standard engineering practice. The true methodological contribution is the detailed analytical model and DSE process described in Section 5.3, not this high-level diagram.

Questions to Address In Rebuttal

The concept of algorithm-architecture co-design is well-established in the literature. Could the authors please clarify the fundamental novelty of their framework beyond the (admittedly important) application of this principle to the specific PD constraints of waferscale systems? What is the core, generalizable insight from your methodology that prior co-design work has missed?
The mesh-switch topology is an intuitive hybrid of mesh and a central switch. Hierarchical topologies that combine different network types at different levels are a classic architectural pattern. What is the non-obvious insight that makes this particular combination uniquely suited for NoW? Is there a theoretical justification for this being the optimal hybrid class, or is its superiority based solely on the empirical DSE results presented?
The paper presents the dual-granularity logical topology as a key contribution. However, given the hierarchical physical topology, a corresponding hierarchical logical topology seems to be the most straightforward and logical implementation choice. Could you elaborate on what makes this logical design non-trivial or inventive in its own right, distinct from the physical structure it is designed to serve?

Cassandra: Efficient Enforcement of Sequential Execution for Cryptographic Programs

Abstract

Constant- time programming is a widely deployed approach to harden cryptographic programs against side channel attacks. However, modern processors often violate the underlying assumptions of standard constant-time policies by transiently executing ...

ACM LINK

Reviews

Review 1

Paper Title: Cassandra: Efficient Enforcement of Sequential Execution for Cryptographic Programs Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose CASSANDRA, a hardware/software co-design aimed at enforcing a strictly sequential execution model for cryptographic programs, thereby eliminating the attack surface of control-flow speculation (Spectre-v1/v2). The core mechanism replaces the conventional Branch Prediction Unit (BPU) for cryptographic code with a "record-and-replay" system. This involves an upfront, offline analysis of a program's branch behavior to generate compressed control-flow traces. These traces are then loaded into a new microarchitectural component, the Branch Trace Unit (BTU), which dictates fetch redirection for crypto code at runtime. The central claims are that this approach not only provides a strong security guarantee against control-flow misspeculation but, counter-intuitively, also results in a performance improvement (a 1.85% speedup on average) over an unprotected baseline by eliminating branch mispredictions.

Strengths

Sound Foundational Premise: The work correctly identifies a core characteristic of constant-time cryptographic code: its control flow is, by design, largely independent of secret inputs. Leveraging this property to pre-compute execution paths is a logical, if not entirely novel, direction for exploration.
Novel Trace Compression: The application of k-mers counting, a technique inspired by genomics, to compress branch traces is a clever and effective method for reducing the storage overhead of the proposed approach. The compression rates reported in Table 1 (page 4) are substantial and represent a non-trivial technical contribution.
Focused Problem Domain: The paper wisely constrains its solution to the domain of cryptographic code, avoiding the untenable challenge of applying this method to general-purpose programs. This focus makes the proposed trade-offs more plausible.

Weaknesses

My primary concerns with this work center on the fragility of its core assumptions, the practical deployability of the solution, and an insufficient analysis of its performance and security implications.

Overstated and Poorly Substantiated Performance Claims: The headline claim of a 1.85% speedup over an "unsafe baseline" is highly suspect and relies on an incomplete performance model.
- The entire performance benefit hinges on avoiding the cost of branch mispredictions. However, the cost of a miss in the proposed Branch Trace Unit (BTU) is not rigorously evaluated. The paper states, "In case of a trace miss in the BTU, the frontend stalls until the trace becomes available" (Section 5, page 6). This is a critical event. A stall to fetch trace data from memory is potentially orders of magnitude more costly than a typical branch misprediction penalty. The evaluation in Section 7 provides no data on BTU miss rates or the latency of servicing such a miss. Without this data, the claimed performance improvement is unsubstantiated. The geomean could easily become a significant slowdown if BTU misses are non-trivial.
- Figure 7 (page 10) shows that for several benchmarks (e.g., sphincs-shake-128s, DES_ct, ModPow_i31), the performance gain is negligible. The 1.85% geomean papers over a highly variable and, in many cases, insignificant result.
Severe Practicality and Deployment Hurdles: The proposed software/hardware contract creates a system that is fundamentally brittle.
- Trace Invalidation: The entire mechanism is predicated on static PC addresses. Any change to the binary—recompilation with a different compiler version, different optimization flags, a security patch, or even static library updates—will invalidate the pre-computed traces. The authors dismiss this by stating traces need to be regenerated (Q2, page 11), but this severely underestimates the logistical complexity in any real-world software ecosystem. It effectively prohibits standard binary distribution and patching, shifting an enormous burden onto developers or distributors.
- Handling of Public Parameters: The solution for code with control flow dependent on public parameters (e.g., key sizes, modes of operation) is to generate separate traces for each configuration (Q1, page 11). This leads to significant binary bloat and a combinatorial explosion of required traces for any moderately complex cryptographic library. The proposed fallback of stalling the pipeline for dynamic cases like stream ciphers is an admission that the core model is not general enough, even within the crypto domain.
Insufficient Security Analysis: The security model has potential gaps that are not adequately addressed.
- The BTU as a New Side Channel: The authors' rebuttal to the BTU being a timing channel is a simple analogy to the ICache (Q5, page 12). This is insufficient. The BTU is a new, stateful structure with its own specific access patterns, LRU replacement policy, and contention points (Pattern Table, Trace Cache, Checkpoint Table). An attacker could potentially engineer contention on these structures to create a new timing side channel, leaking information about which crypto branches are being executed by a victim. A rigorous analysis of this new attack surface is absent.
- Integrity of the Crypto PC Range: The system's security relies on correctly partitioning the address space into "crypto" and "non-crypto" regions. The mechanism for this is a "new status register" (Section 5.2, page 7). How is this register set and protected? More critically, the paper claims to prevent non-crypto branches from speculatively redirecting to crypto code via an "integrity check" (Section 5.3, page 7). This is a crucial security boundary, yet its implementation and robustness are not detailed. A failure here would allow an attacker to bypass CASSANDRA entirely by misspeculating from outside the protected region to a location inside it.

Questions to Address In Rebuttal

The authors must provide clear, data-driven answers to the following questions to justify their claims:

BTU Performance: What were the miss rates for the Pattern Table and Trace Cache across all benchmarks in your evaluation? What is the full cycle cost assumed for servicing a BTU miss that requires a memory access? Please provide a sensitivity analysis showing how the overall performance changes as this miss latency increases.
Software Lifecycle: Please provide a concrete workflow for how a large, distributed software project (e.g., OpenSSL) would integrate CASSANDRA's trace generation into its development, testing, and security patching lifecycle without creating an untenable maintenance burden.
BTU Security: Beyond the ICache analogy, please provide a detailed analysis of the potential for timing-based side channels within the BTU's components. Can an attacker in a sibling hardware thread create contention in the PAT/TRC to infer the control flow of a victim process? Is the BTU state flushed on all context switches, and if so, what is the performance cost?
Boundary Protection: Please detail the exact hardware mechanism of the "integrity check" that prevents non-crypto branches from speculatively targeting crypto-code regions. How is this check performed without re-introducing the very pipeline stalls that branch prediction aims to avoid? How is its correctness and security guaranteed?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the critical problem of speculative execution attacks undermining the security guarantees of constant-time cryptographic programs. Existing defenses typically incur performance overhead by restricting or fencing speculation. The authors of CASSANDRA propose a radical and elegant alternative: for this specific class of programs, they advocate for entirely replacing control-flow speculation with a deterministic "record-and-replay" mechanism.

The core idea is enabled by two key insights into constant-time cryptography: (1) its sequential control flow is, by design, independent of secret inputs and thus largely static, and (2) its implementation is typically loop-intensive and repetitive, making its control-flow trace highly compressible. The authors introduce a novel offline analysis step, cleverly inspired by k-mers counting from bioinformatics, to massively compress these traces. In hardware, a new, small component called the Branch Trace Unit (BTU) consumes these compressed traces to provide perfect, deterministic fetch redirection for crypto code, while the conventional Branch Prediction Unit (BPU) handles non-crypto code.

The central and most compelling result is that this mechanism not only enforces a strict sequential execution model—the very model that constant-time programmers assume—but also counter-intuitively yields an average performance speedup of 1.85% over an unsafe baseline processor. This transforms a security mechanism into a performance optimization.

Strengths

A Novel and Elegant Conceptual Shift: The most significant strength of this work is its departure from the prevailing "manage the damage" paradigm of Spectre defenses (e.g., taint tracking, fencing). Instead of trying to secure a speculative process, CASSANDRA replaces it with a deterministic one for a domain where this is feasible. This "record-and-replay" concept is a fundamentally different, and arguably much cleaner, approach to solving the control-flow speculation problem for deterministic code.
Exceptional Performance Results: Achieving a security guarantee that is stronger than the state-of-the-art while simultaneously improving performance over an unprotected baseline is a remarkable result. The elimination of branch mispredictions for crypto kernels is a powerful performance lever, and the authors have successfully exploited it. This flips the traditional security-performance trade-off on its head and makes adoption highly compelling.
Strong and Simple Security Guarantee: CASSANDRA enforces the exact sequential execution model that programmers reason about when writing constant-time code. This closes the dangerous gap between the software's security model and the hardware's execution reality. It provides a clear, understandable guarantee: for CASSANDRA-protected code, control flow will not transiently deviate from the sequential path.
Clever Cross-Pollination of Ideas: The use of trace compression techniques inspired by DNA sequencing (Section 4.2, page 4) is an excellent example of creative, cross-disciplinary thinking. The problem of finding repeating patterns in a massive, linear trace of branch outcomes is structurally analogous to finding patterns in a DNA sequence, and the authors' application of this concept is what makes the entire approach practical by solving the trace storage problem.

Weaknesses

While the core idea is powerful, its practical scope and deployment model present some challenges that could be explored further. These are not fundamental flaws but rather boundaries of the proposed solution.

Domain Specificity: The solution is explicitly and effectively tailored to constant-time cryptographic code. While this is a critically important domain, the technique is not general-purpose. The paper is clear about this, but the impact is necessarily confined to workloads that feature this type of deterministic control flow.
Toolchain and Maintenance Overhead: The security of CASSANDRA relies on an offline, pre-computed trace. This introduces a tight coupling between the compiled binary and the hardware mechanism. Any recompilation, patch, or change in compiler options that alters instruction addresses (PCs) would require regenerating the traces. While the authors' automated procedure (Section 4.3, page 5) is a good first step, the integration of this process into the complex build and maintenance pipelines of major software projects (like OS kernels or large crypto libraries) presents a practical friction point that is not fully explored.
Handling of Public, Non-Secret Parameters: The paper discusses handling control flow that depends on public parameters (e.g., key sizes) by generating separate traces for each mode (Discussion Q1, page 11). For cryptographic primitives with many modes or configuration options, this could lead to a non-trivial increase in binary size to store all possible trace variants. This static, enumerative approach may become unwieldy.

Questions to Address In Rebuttal

The strength of CASSANDRA is its specialization. Looking beyond cryptography, could the authors elaborate on other potential domains where this deterministic "record-and-replay" model might be applicable? For example, could it be used for scientific computing kernels, codecs, or other programs with data-independent control flow, thereby broadening the potential impact of this architectural concept?
The trace generation process is an offline step. How do the authors envision this integrating into standard development and deployment pipelines, particularly for large, continuously evolving libraries like OpenSSL? What is the expected friction for developers and maintainers when patching code (e.g., for a non-crypto-related bug) that requires re-linking and thus invalidates all PC-based traces?
Regarding public parameters, the proposed solution of storing separate traces for different modes (e.g., AES-128 vs. AES-256) seems practical for a few variants. However, for algorithms with a wider range of public inputs that affect control flow (e.g., certain post-quantum schemes or protocol state machines), this could lead to significant binary bloat. Have the authors considered a more dynamic or hybrid approach, or is this static pre-generation seen as the only feasible path for maintaining performance?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes CASSANDRA, a hardware/software co-design to enforce a strictly sequential execution model for constant-time cryptographic programs, thereby eliminating the attack surface for control-flow speculation attacks like Spectre. The core idea is to completely disable the branch predictor for this class of code and instead rely on a "record-and-replay" mechanism. This involves an offline analysis phase where the sequential control flow trace of a program is captured, compressed using techniques inspired by bioinformatics (k-mers counting), and embedded in the binary. At runtime, a new microarchitectural component, the Branch Trace Unit (BTU), consumes these compressed traces to provide perfect, deterministic fetch redirection, obviating the need for prediction. The authors claim this not only provides a strong security guarantee but also results in a performance speedup over an unsafe baseline due to the elimination of branch mispredictions.

Strengths

The primary strength of this paper lies in its novel re-contextualization of existing concepts for a new purpose, and the clever cross-domain application of a compression technique.

Novel Repurposing of a Known Concept: The idea of recording and replaying a stream of execution is not, in itself, new. It is the fundamental principle behind Trace Caches (Rotenberg et al., MICRO 1996 [58], 1997 [59]), which record dynamic instruction traces to improve fetch bandwidth. However, CASSANDRA repurposes this concept for an entirely different goal: security enforcement rather than performance optimization. The key delta is that Trace Caches are dynamic, opportunistic, and fall back to a speculative frontend, whereas CASSANDRA is static, mandatory, and its core design principle is the elimination of speculation. This shift in purpose leads to a fundamentally different design (offline analysis, stall-on-miss) which, to my knowledge, is a novel approach for Spectre mitigation.
Novel Application of Compression Technique: The use of k-mers counting, a technique from DNA sequencing, to compress branch traces is a genuinely novel contribution within the computer architecture domain. Prior work on trace compression has typically relied on more standard run-length encoding or dictionary-based methods. Applying an algorithm designed to find unknown repeating patterns in massive biological sequences to the problem of highly regular, loop-intensive cryptographic control flow (as detailed in Section 4.2, page 4) is a clever and effective insight that enables the entire proposal. This is the paper's most significant and indisputably new technical idea.
Surprising Result from a Novel Approach: The combination of a strict, security-enforcing execution model with a net performance gain is a powerful and non-obvious result. Typically, security mechanisms that restrict speculation incur significant performance overhead. The insight that for this specific domain, perfect prediction (via replay) is faster than state-of-the-art speculation is a novel finding that validates the complexity of the proposed mechanism.

Weaknesses

While the core ideas are strong, the novelty could be framed more sharply by explicitly differentiating from conceptually adjacent prior art.

Insufficient Differentiation from Profile-Guided Prediction: The paper cites profile-guided branch prediction work like Whisper [33], but it could do more to contrast its mechanism. Both use offline analysis to inform runtime branch resolution. The critical distinction is that Whisper assists the predictor, while CASSANDRA replaces it. CASSANDRA provides a guarantee of sequential control flow, whereas profile-guided schemes still operate within a speculative framework and offer no such guarantee. This distinction is the crux of the novelty and should be more central to the paper's narrative.
The "Record-and-Replay" Moniker: While accurate, the term "record-and-replay" is heavily loaded with prior art from debugging and performance (e.g., deterministic replay systems). The authors' novelty is not in the general concept, but in its static, compressed, and security-enforced instantiation for the CPU frontend. The paper would benefit from language that more precisely captures this specific instantiation to avoid being conflated with functionally different systems that share the same high-level name.
Novelty is Domain-Specific: The novelty and effectiveness of the entire approach are predicated on the unique properties of constant-time cryptographic code (static control flow). This is not a weakness of the work itself, but a boundary on its novelty. The proposed mechanism is not a general-purpose architectural innovation but a highly specialized one. This should be acknowledged more directly as a trade-off that enables the design.

Questions to Address In Rebuttal

Regarding Trace Caches: Please explicitly articulate the three most critical conceptual and implementation differences between the proposed Branch Trace Unit (BTU) and a classic Trace Cache [58, 59]. The key is to move beyond the difference in goals (security vs. performance) and focus on how those goals lead to a fundamentally different mechanism (e.g., static vs. dynamic trace generation, content of the trace, behavior on a miss).
Regarding Trace Compression: The application of k-mers counting from bioinformatics is a central pillar of your work. To the best of your knowledge, is this the first proposal to use this specific class of pattern detection algorithms for the compression of any hardware-level execution trace (e.g., branch, address, value traces) in the computer architecture literature?
Regarding Input-Dependent Control Flow: Footnote 3 on page 5 states that for branches whose traces depend on public inputs (e.g., plaintext length in a stream cipher), the processor stalls until the branch resolves. Could you quantify the performance impact of this specific case? More importantly from a novelty standpoint, does this necessary escape hatch represent a fundamental limitation of the "pure" replay model you propose?
Regarding Brittleness of the Approach: The proposed solution is novel because it is tailored to the properties of constant-time code. How does the mechanism behave if a programmer introduces a secret-dependent branch by mistake? Does the offline analysis tool detect this? If not, does the system fail in a secure way (e.g., by default-stalling) or does it fall back to an insecure behavior? The novelty of the solution is tied to its security guarantee, and it is important to understand the failure modes.

FAST:An FHE Accelerator for Scalable-parallelism with Tunable-bit

Abstract

Fully Homomorphic Encryption (FHE) enables direct computation on encrypted data, providing substantial security advantages in cloud-based modern society. However, FHE suffers from significant computational overhead compared to plaintext computation, ...

ACM LINK

Reviews

Review 1

Reviewer: The Guardian

Summary

The paper proposes FAST, a novel hardware accelerator for Fully Homomorphic Encryption (FHE) based on the CKKS scheme. The central thesis is that existing accelerators are too rigid, relying on a single key-switching method (e.g., Hybrid) and failing to leverage recent cryptographic optimizations such as the gadget decomposition key-switching method (KLSS) and hoisting. To address this, the authors introduce a versatile framework, Aether-Hemera, to dynamically select the optimal key-switching method based on the ciphertext level. The core hardware innovation is a Tunable-Bit Multiplier (TBM) designed to efficiently execute both 36-bit operations (optimal for Hybrid) and 60-bit operations (optimal for KLSS). The authors claim their design achieves a significant 1.8x average speedup over state-of-the-art accelerators.

Strengths

Sound Motivation: The paper correctly identifies a critical gap in the FHE accelerator landscape. The observation that the computational trade-offs between different key-switching methods (Hybrid vs. KLSS) vary with ciphertext level (l) is insightful and provides a strong motivation for a more flexible hardware architecture.
Novel Architectural Concept: The idea of a tunable-precision datapath, embodied by the Tunable-Bit Multiplier (TBM), is a clever architectural response to the divergent computational requirements of modern FHE algorithms. Supporting both 36-bit and 60-bit operations within a unified multiplier is a non-trivial design contribution.
Analysis of Algorithmic Trade-offs: The analysis presented in Section 3, particularly in Figures 2 and 3, provides a valuable characterization of the performance landscape for different key-switching methods and the impact of hoisting. This analysis effectively grounds the hardware design choices in concrete algorithmic behavior.

Weaknesses

Despite its interesting premise, the paper suffers from several critical weaknesses related to its evaluation methodology, internal consistency, and substantiation of claims. These issues undermine the credibility of the reported performance improvements.

Questionable and Potentially Unfair Baselines: The primary performance comparison is against SHARP [20] and several enhanced variants (SHARPLM, SHARP8C, SHARPLM+8C) that the authors appear to have modeled rather than implemented. Footnote 2 on page 11 states, "We also model performance under large memory conditions by comparing the effects of reduced computational workload." This is not a substitute for a rigorous, cycle-accurate simulation of a modified baseline. It is highly likely that this "modeling" approach optimistically estimates the baseline's performance, thereby artificially inflating FAST's relative speedup. Similarly, the claim that SHARPLM "integrates direct hoisting technology" is not substantiated with implementation details. Without a fair and rigorously simulated baseline, the central performance claims of the paper are suspect.
Conflation of Algorithmic and Architectural Gains: The paper's core contribution is presented as a new architecture, yet a substantial portion of the performance gain stems from employing a superior algorithm (KLSS) that the baseline (SHARP) was never designed to support. The evaluation fails to disentangle these two effects. The "Efficiency Study" in Section 7.6 compares FAST (with TBM) to "FAST with a 36-bit ALU," which is still the authors' own architecture. A proper ablation study would compare FAST to a baseline like SHARP when both are limited to running the same algorithm (i.e., the Hybrid method). As it stands, it is impossible to determine how much of the speedup is due to the novel TBM and flexible architecture versus simply using a better algorithm.
Contradiction in On-Chip Memory Requirements: The authors claim in Section 5.6 (page 10) that the designed on-chip memory of 245MB is "enough to support the KLSS method." However, their own analysis in Figure 3b (page 5) clearly shows that the working set size for KLSS can reach nearly 295MB at the highest ciphertext levels. This is a direct contradiction. If FAST cannot support KLSS at the very levels where Figure 2a suggests it is most beneficial, the entire premise of the Aether-Hemera dynamic selection framework is compromised. The paper does not address how this memory shortfall is handled.
Vague Description of the Aether-Hemera Framework: The Aether analysis tool is described as "preprocessing on the server side" (Section 4.1.1, page 6) that takes an "FHE operation flow" to generate a configuration file. This description is superficial. It is unclear if this requires a full pre-simulation of the application or if it is a simple static analysis. Its ability to handle dynamic or data-dependent control flow is not discussed, limiting its generality. The complexity of managing two distinct sets of evaluation keys (for 36-bit and 60-bit moduli) and the potential hardware reconfiguration overheads are mentioned as challenges but are not adequately addressed in the description of Hemera's online management.
Unsupported Hardware Claims: The paper claims the TBM achieves its flexibility with "only 28% area overhead relative to conventional 60-bit multipliers" (Section 4.2, page 8). The definition of a "conventional" 60-bit multiplier is not provided; a fair comparison would be against a well-optimized, potentially composed, 60-bit modular multiplier, not necessarily a monolithic one. Furthermore, the assertion that "existing accelerators will struggle to integrate these capabilities" is a strong claim made without any supporting evidence or detailed reasoning.

Questions to Address In Rebuttal

Please provide a detailed methodology for how the enhanced SHARP baselines (SHARPLM, SHARP8C) were created. Was a cycle-accurate model of the SHARP architecture modified to include larger memory and support for hoisting, or was performance estimated by simply adjusting operation counts? The validity of the 1.8x speedup claim hinges on the fairness of this comparison.
Please address the apparent contradiction between the stated on-chip memory size of 245MB and the peak working set requirement of ~295MB for KLSS shown in your own Figure 3b. Does this memory limitation prevent the use of KLSS at high ciphertext levels? If so, how does this impact the effectiveness of the Aether-Hemera framework and the overall performance results?
To isolate the architectural contribution of FAST, can you provide performance data for your architecture against a baseline like SHARP when both systems are restricted to executing only the Hybrid key-switching method? This would provide a direct, apples-to-apples comparison of the hardware's efficiency.
Could you elaborate on the offline analysis process of the Aether tool? What are its computational complexity and limitations? Specifically, does it require a full trace or simulation of the target application, and how would it handle applications with non-static computation graphs?
Regarding the TBM's 28% area overhead claim, what specific design was used as the "conventional 60-bit multiplier" baseline? Please provide details on its implementation to justify the comparison.

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents FAST, a novel hardware accelerator for Fully Homomorphic Encryption (FHE) based on the CKKS scheme. The central contribution lies in its co-designed software/hardware architecture that dynamically adapts to the changing computational requirements of an FHE application. The authors' core insight is that different key-switching algorithms—specifically the traditional hybrid method and the more recent KLSS method—exhibit superior performance at different ciphertext levels. Furthermore, these methods have disparate precision requirements (36-bit for hybrid vs. 60-bit for KLSS).

To exploit this, FAST introduces two key innovations: 1. Aether-Hemera Framework: A software layer where "Aether" performs offline analysis to determine the optimal key-switching strategy (including the use of hoisting) at each stage of the computation, and "Hemera" manages the corresponding evaluation keys at runtime. 2. Tunable-Bit Multiplier (TBM): A versatile hardware multiplier unit that can be dynamically configured to perform either two 36-bit multiplications in parallel or a single 60-bit multiplication, thereby efficiently supporting both key-switching methods without the significant area overhead of dedicated datapaths.

By integrating these concepts, FAST is the first accelerator architecture, to my knowledge, to systematically support and switch between multiple state-of-the-art key-switching primitives within a single FHE program execution. The evaluation demonstrates a significant average speedup of 1.8x over existing state-of-the-art FHE accelerators.

Strengths

Excellent Core Insight and Motivation: The paper's primary strength is its foundational observation, clearly articulated in the Motivation section (Section 3, pages 4-6). The analysis presented in Figure 2, which shows the performance crossover between the hybrid and KLSS key-switching methods as a function of ciphertext level, is compelling. This insight elevates the work from being "just another accelerator" to one that addresses a fundamental, dynamic property of FHE computations. It correctly identifies that a one-size-fits-all hardware design is suboptimal.
Bridging Cryptographic Theory and Hardware Architecture: This work does an admirable job of connecting recent advances in cryptographic methods (KLSS, hoisting) with concrete hardware design. Many accelerator papers optimize for established, sometimes outdated, cryptographic primitives. By explicitly designing for the latest techniques described in papers like Kim et al. [22] and Chen et al. [10], the authors ensure their architecture is relevant and pushes the state-of-the-art forward. This is a crucial contribution to the community, demonstrating how architects must co-evolve their designs with the underlying algorithms.
Elegant Hardware/Software Co-Design: The proposed solution is not purely a hardware effort. The Aether-Hemera framework is a clever software abstraction that manages the complexity of the dynamic decision-making process. It allows the hardware (specifically the TBM) to be flexible but relatively simple, while the intricate logic of when to switch precision and algorithms is handled offline. This separation of concerns is a hallmark of good system design.
Novel and Practical Hardware Primitive (TBM): The Tunable-Bit Multiplier is an elegant solution to the multi-precision problem identified in Section 3.2. Instead of brute-forcing the issue with separate 36-bit and 60-bit datapaths, the TBM offers a reconfigurable unit that maximizes parallelism for the common 36-bit case while still efficiently supporting the demanding 60-bit case. This demonstrates a deep understanding of hardware design trade-offs.

Weaknesses

While the core idea is strong, the paper could be improved by further exploring the implications and limitations of its approach.

Generality of the Aether Analysis: The Aether tool is presented as a key enabler, but its inner workings and generality are not fully detailed. It seems to function as a profile-guided optimization tool. How dependent is this analysis on specific FHE parameter sets (e.g., ring degree N, initial modulus Q)? If an application or its security parameters change, does the entire analysis need to be re-run? A discussion on the robustness and potential overhead of this offline stage would strengthen the paper.
Sensitivity to On-Chip Memory: The design relies on a very large (245MB) on-chip memory to support the significant evaluation key storage requirements of KLSS and hoisting (as shown in Figure 3b). This is a major factor in the total chip area (Table 3). While the authors perform a sensitivity analysis in Figure 13, the discussion could be expanded. The current approach seems tailored for a high-end design point. It would be valuable to understand how the Aether framework's decisions and the overall performance would change under more constrained memory budgets (e.g., 64MB or 128MB), which might be more commercially viable.
Complexity of Key Management: The Hemera runtime component must manage multiple sets of evaluation keys for different algorithms and potentially different hoisting factors. While the paper states this is handled, the true complexity and potential for pipeline stalls or memory bank conflicts during key loading and switching are not deeply analyzed. A more detailed examination of the runtime overhead of this dynamic management would be beneficial.

Questions to Address In Rebuttal

Could the authors elaborate on the automation and generality of the Aether analysis tool? For a new FHE application not benchmarked in this paper, what is the process and computational cost required for Aether to generate the optimal execution plan? Is any manual intervention or application-specific tuning required?
The 245MB on-chip memory is substantial. Could you comment on the performance trade-offs if the memory is constrained to a more modest size, say 128MB? How would the Aether tool adapt its strategy, and what would be the anticipated performance degradation on a benchmark like bootstrapping? Would it simply fall back to using the hybrid method more often?
The core idea of dynamic algorithm selection is powerful. Have you considered extending this concept beyond key-switching? For example, could different NTT algorithms or base conversion strategies be dynamically selected depending on the computational context? This work seems to open a new avenue for FHE accelerator design, and I would be interested to hear your perspective on its broader applicability.

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present FAST, a hardware accelerator for the CKKS Fully Homomorphic Encryption (FHE) scheme. The paper's central claim to novelty rests on being the first hardware architecture to incorporate and dynamically manage multiple recent cryptographic optimizations that have, until now, primarily been explored in software. Specifically, the work claims novelty in three areas:

The co-integration and hardware support for two distinct key-switching methods: the traditional hybrid method and the more recent gadget decomposition (KLSS) method.
The creation of a software/hardware framework (Aether-Hemera) to analyze workload characteristics (e.g., ciphertext level) and dynamically select the optimal key-switching method during a single application's execution.
The design of a "Tunable-Bit Multiplier" (TBM) architecture capable of dynamically switching between performing dual 36-bit multiplications or a single 60-bit multiplication, thereby efficiently supporting the differing precision requirements of the two key-switching methods.

The core idea is not a new fundamental cryptographic primitive, but rather a novel system-level co-design that synthesizes existing, but disparate, software-level optimizations into a cohesive and reconfigurable hardware architecture.

Strengths

From the perspective of novelty, the paper's primary strengths are:

Novel Synthesis of Existing Art: The core contribution is the architectural synthesis of very recent cryptographic optimizations. The KLSS method (Kim et al., 2023) and hoisting techniques have been proposed to improve FHE performance on CPUs/GPUs, but prior accelerator designs (e.g., SHARP, ARK) have not incorporated them. FAST appears to be the first to build a custom microarchitecture specifically to exploit them. This synthesis is non-trivial and represents a tangible advancement.
Novel Problem Identification and Motivation: The analysis in Section 3.1 (page 5, Figure 2) is a key contribution in its own right. The observation that the computational advantage shifts between the hybrid and KLSS methods depending on the ciphertext level (l) is a crucial insight. This provides a strong, novel motivation for a dynamic, multi-method architecture, moving beyond the static "one-size-fits-all" approach of prior work.
Purpose-Built Reconfigurable Hardware: The Tunable-Bit Multiplier (TBM) presented in Section 4.2 (page 8) is a clever hardware solution to the problem identified. While reconfigurable or multi-precision arithmetic units are not fundamentally new concepts in digital design, the TBM is a purpose-built unit whose design is directly motivated by the specific precision dichotomy (36-bit vs. 60-bit) of the two targeted key-switching methods. Its novelty lies in its specific application and tight integration to solve a unique FHE acceleration challenge.

Weaknesses

My critique is focused on the precise boundaries of the novelty presented:

Contribution is Primarily in Synthesis, Not Invention: It must be stated clearly that the constituent parts of the proposed system are not new. The KLSS algorithm, hoisting, and Karatsuba-style multiplication (which the TBM is a variant of) are all pre-existing concepts. The paper's novelty rests entirely on being the first to combine them in a hardware accelerator. While this is a valid contribution, the paper could be more precise in delineating its system-level synthesis contribution from the underlying, previously-published algorithmic work.
Limited Scope of the Dynamic Framework: The Aether-Hemera framework is presented as a solution for choosing between the hybrid and KLSS methods. However, the work does not explore whether this framework represents a more general, extensible principle for FHE accelerator design. Is this a point solution for these two specific methods, or a paradigm that could incorporate future, as-yet-unknown FHE primitives and optimizations? The novelty would be significantly stronger if the latter were demonstrated or at least convincingly argued.
Complexity vs. Benefit Justification: The proposed architecture introduces significant complexity: a dual-method management framework, more intricate key management for two different key types, and a more complex multiplier unit with added control logic. While the authors show a notable performance gain (avg. 1.8x over SHARP), the performance-per-area gain is more modest (1.13x). This suggests that the novel techniques add nearly as much hardware cost as they provide in performance benefit. A deeper analysis on whether this trade-off is fundamentally advantageous across a wider range of parameters would strengthen the claim. For instance, how does this trade-off change if on-chip memory is more constrained?

Questions to Address In Rebuttal

The core novelty appears to be the system-level co-design of previously separate software optimizations. Beyond being "the first," what is the fundamental architectural insight or principle that future designers can learn from FAST? Is it simply "support more algorithms," or is there a deeper principle about reconfigurability in FHE that you are proposing?
Regarding the Tunable-Bit Multiplier (TBM): The use of three smaller multipliers to construct one larger multiplier is a well-known technique (e.g., Karatsuba). Please clarify how the TBM's design is novel beyond a direct hardware mapping of this principle to solve the specific 36/60-bit requirement. Was a broader design space of multi-precision units explored?
The Aether-Hemera framework makes decisions based on ciphertext level and hoisting opportunities. Could the authors elaborate on its extensibility? If a new key-switching algorithm with different precision requirements and performance characteristics were proposed next year, how much of the Aether-Hemera decision logic and the underlying FAST hardware would need to be redesigned? This speaks directly to the durability of the novel contribution.

Neo: Towards Efficient Fully Homomorphic Encryption Acceleration using Tensor Core

Abstract

Fully Homomorphic Encryption (FHE) is an emerging cryptographic technique for privacy-preserving computation, which enables computations on the encrypted data. Nonetheless, the massive computational demands of FHE prevent its further application to real-...

ACM LINK

Reviews

Review 1

Reviewer: The Guardian

Summary

This paper proposes Neo, a framework for accelerating Fully Homomorphic Encryption (FHE) computations by mapping them onto the Tensor Core units found in modern NVIDIA GPUs. The core idea is to decompose the large integer polynomial multiplications at the heart of FHE into a series of smaller, fixed-precision matrix multiplications that are a perfect fit for the Tensor Core hardware. The authors present a "limb-interleaving" data layout strategy to manage the high-precision arithmetic and a "fragment-based" scheduling approach to handle polynomials that are larger than the Tensor Core's native dimensions. They claim this method achieves significant speedups (up to 7.8x) over existing CPU and GPU-based FHE libraries.

Strengths

The paper is founded on a clever and pragmatic observation that has the potential for significant real-world impact.

Pragmatic Use of Existing Hardware: The core strength of this paper is its recognition that modern GPUs contain highly specialized, powerful compute units (Tensor Cores) that are currently underutilized by FHE workloads. The idea of re-purposing this existing, ubiquitous hardware for FHE acceleration, rather than designing a new ASIC, is a sound and practical approach.
Clear Algorithmic Breakdown: The paper does an excellent job of systematically breaking down the complex problem of large polynomial multiplication into a sequence of smaller matrix multiplications that can be directly mapped to the Tensor Core's capabilities (Section 3, Pages 3-4). The mathematical formulation is clear and provides a solid foundation for the proposed mapping strategy.

Weaknesses

Despite the clever premise, the paper's conclusions are undermined by an incomplete analysis, questionable evaluation methodologies, and a failure to address critical, well-known challenges in GPU-based acceleration.

Fundamentally Flawed Baseline Comparison: The headline performance claims are invalid because the comparison to the CPU baseline (Microsoft SEAL) is not equitable. The reported speedups are largely due to the massive difference in raw compute power and memory bandwidth between a high-end A100 GPU and a CPU. A rigorous evaluation would require comparing Neo not just to a CPU library, but to a state-of-the-art, highly-optimized GPU implementation that uses standard CUDA cores for the same FHE operations. Without this direct GPU-to-GPU comparison, it is impossible to isolate the true benefit of using Tensor Cores from the general benefit of using a GPU.
Overhead of Data Reshaping is Ignored: The proposed "limb-interleaving" strategy (Section 4.1, Page 5) requires significant data reshaping and pre-processing to transform the polynomial coefficients into the specific matrix layout required by the Tensor Cores. This shuffling and re-ordering of data in GPU memory is not a "free" operation; it consumes memory bandwidth and execution cycles. The paper's performance model appears to completely ignore this overhead, which could be substantial and could significantly erode the claimed performance benefits. The analysis focuses only on the matrix multiplication itself, which is a classic flaw in accelerator research.
Insufficient Analysis of Precision and Noise: The paper states that it uses a 64-bit representation (FP64) for its intermediate calculations to "guarantee the correctness" (Section 4.2, Page 6). This is a hand-wavy and insufficient analysis. FHE computations are notoriously sensitive to noise growth, and simply using a standard floating-point format does not automatically guarantee correctness. The paper lacks a rigorous error analysis that tracks the noise propagation through the proposed decomposition and mapping process. It is not proven that the results are cryptographically sound, only that they are numerically approximate.
Scalability Claims are Unsubstantiated: The paper proposes a "fragment-based" scheduling method to handle large polynomials (Section 5, Page 7) but provides insufficient evidence of its efficiency. The evaluation is limited to a specific set of FHE parameters (Table 4, Page 10). It is unclear how the performance scales as the polynomial degree N and the ciphertext modulus Q grow to the very large values required for deep, complex FHE applications. The overhead of managing and scheduling a large number of fragments could easily overwhelm the benefits of using the Tensor Cores.

Questions to Address In Rebuttal

To provide a sound comparison, please evaluate Neo against a state-of-the-art GPU implementation of the same FHE scheme that uses standard CUDA cores and is optimized for the same A100 hardware. This is the only way to prove the specific benefit of using Tensor Cores.
Please provide a detailed performance breakdown that includes the overhead of the "limb-interleaving" data-reshaping step. What is the latency and memory bandwidth consumption of this pre-processing step, and how does it impact the end-to-end performance of a complete FHE operation?
Provide a rigorous cryptographic noise analysis. Show how the noise in the ciphertext propagates through your proposed decomposition, matrix multiplication, and reconstruction process, and prove that the final result remains below the failure threshold for all evaluated parameter sets.
To substantiate your scalability claims, please provide a detailed performance model and evaluation for much larger FHE parameter sets, demonstrating how the fragment scheduling overhead and memory pressure scale as the polynomial degree and modulus size increase significantly.

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces Neo, a novel framework for accelerating Fully Homomorphic Encryption (FHE) by leveraging the specialized Tensor Core units found in modern NVIDIA GPUs. The central idea is to bridge the gap between the world of large-integer polynomial arithmetic, which underpins FHE, and the world of high-performance, low-precision matrix multiplication, which is the native domain of Tensor Cores. Neo achieves this by presenting a clever algorithmic mapping: it decomposes the large polynomial multiplications into a series of smaller matrix-matrix multiplications that can be executed directly on the Tensor Core hardware. To manage the data flow and precision requirements, the authors propose a "limb-interleaving" data layout and a "fragment-based" scheduling strategy. This work effectively opens up a new pathway for FHE acceleration by re-purposing existing, powerful, and widely available hardware that was originally designed for AI workloads.

Strengths

This paper's primary strength is its elegant and pragmatic approach to a difficult problem, demonstrating a keen understanding of both the FHE domain and the underlying hardware landscape.

A Brilliant Repurposing of Existing Hardware: The most significant contribution of this work is its clever recognition that the Tensor Cores, while designed for AI, are fundamentally high-performance matrix engines that can be repurposed for other domains. Instead of proposing a bespoke ASIC, which is costly and has a long design cycle, this paper provides a software-based solution that unlocks the immense computational power of existing, off-the-shelf hardware for FHE (Section 1, Page 2). This is a powerful and practical approach that has the potential for immediate, widespread impact. 💡
Connecting Disparate Computational Domains: This work serves as a beautiful bridge between two traditionally separate fields: the high-level, abstract mathematics of lattice-based cryptography and the low-level, nitty-gritty details of GPU microarchitecture. By demonstrating how the Number Theoretic Transform (NTT), a cornerstone of FHE, can be algorithmically reframed as a series of matrix multiplications (Section 3, Pages 3-4), the paper provides a crucial Rosetta Stone that allows the FHE community to tap into the billions of dollars of R&D that have been invested in AI hardware.
Enabling a New Class of GPU-Accelerated Cryptography: Prior work on GPU acceleration for FHE has largely focused on using the general-purpose CUDA cores. While effective, this approach often fails to utilize the GPU to its full potential, as the Tensor Cores sit idle. By specifically targeting the most powerful compute units on the chip, Neo paves the way for a new generation of GPU-accelerated cryptographic libraries that are far more efficient and performant. It opens up a new and fertile area of research for the high-performance computing and applied cryptography communities.

Weaknesses

While the core idea is powerful, the paper could be strengthened by broadening its focus from a proof-of-concept to a more robust, production-ready system.

The Data Reshuffling Elephant in the Room: The proposed "limb-interleaving" strategy is a clever way to format the data for the Tensor Cores. However, this data shuffling itself has a performance cost. The paper focuses on the speedup of the core computation but spends less time analyzing the overhead of the data preparation and pre-processing steps. In many GPU workloads, data movement and marshalling can become a significant bottleneck, and a more detailed analysis of this overhead would provide a more complete performance picture.
The Software Abstraction Layer: Neo provides the low-level "how-to" for mapping FHE onto Tensor Cores. The next critical step, which is not fully explored, is the software abstraction layer. For this technique to be widely adopted, it needs to be integrated into a high-level FHE library (like Microsoft SEAL) and exposed through a clean API. A discussion of the challenges in building this compiler and runtime layer—which would need to automatically handle the decomposition, scheduling, and data layout—would be a valuable addition.
Beyond a Single GPU: The paper successfully demonstrates the potential of Neo on a single GPU. The natural next step is to consider how this approach scales to a multi-GPU or multi-node environment. As FHE applications grow in complexity, they will inevitably require the resources of a full server or cluster. A discussion of how Neo would interact with high-speed interconnects like NVLink and how the fragment-based scheduling could be extended to a distributed setting would be very interesting.

Questions to Address In Rebuttal

Your work brilliantly repurposes hardware designed for AI. Looking forward, do you envision future GPUs having features specifically designed to make this mapping even more efficient? For example, could a future generation of Tensor Cores have native support for modular arithmetic or more flexible data layout options?
The "limb-interleaving" strategy is key to your approach. How does the overhead of this data pre-processing scale as the FHE parameters (and thus the polynomial sizes) grow? Is there a point where the cost of data shuffling begins to diminish the benefits of using the Tensor Cores?
For this technique to have broad impact, it needs to be integrated into a user-friendly library. What do you see as the biggest challenges in building a compiler or runtime system that could automatically and optimally apply the Neo framework to an arbitrary FHE program? 🤔
How does the performance of Neo change with different generations of Tensor Cores (e.g., from Volta to Ampere to Hopper)? Does the growing power and complexity of the Tensor Cores open up new opportunities or create new bottlenecks for your approach?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents Neo, a framework for accelerating Fully Homomorphic Encryption (FHE) on modern GPUs. [cite_start]The core novelty claim is a new methodology for mapping the large-integer polynomial arithmetic central to FHE onto the fixed-precision, matrix-oriented Tensor Core units[cite: 1, 33]. [cite_start]This is achieved through two primary novel techniques: 1) A "limb-interleaving" data layout strategy that transforms polynomial coefficients into a matrix format suitable for the Tensor Cores [cite: 203, 206][cite_start], and 2) a "fragment-based" scheduling approach that decomposes large polynomial operations into a sequence of smaller matrix multiplications that fit the hardware's native dimensions[cite: 284]. [cite_start]The work claims this is the first implementation to accelerate the critical BConv and IP FHE kernels on Tensor Cores, offering a new pathway for FHE acceleration[cite: 146, 575].

Strengths

From a novelty standpoint, this paper's strength lies in its clever and non-obvious algorithmic mapping, which bridges two seemingly incompatible computational domains.

[cite_start]Novel Algorithmic Transformation: The most significant "delta" of this work is the algorithmic transformation that reframes large-integer polynomial multiplication as a series of small, fixed-precision matrix multiplications[cite: 146, 575]. While using GPUs for FHE is not new, prior work has focused on using the general-purpose CUDA cores. This paper is the first to devise a concrete method for mapping these operations onto the highly specialized and architecturally rigid Tensor Cores. This conceptual leap—seeing a path from polynomial math to matrix math—is a significant and novel contribution. It unlocks a powerful, previously untapped computational resource for the FHE domain. 🧠
[cite_start]Novel Data Layout and Scheduling: The "limb-interleaving" [cite: 203, 206] [cite_start]and "fragment-based" [cite: 284] techniques are direct and novel solutions to the two primary challenges of this mapping: precision and dimensionality. Standard FHE requires high-precision arithmetic, while Tensor Cores are low-precision. Standard FHE polynomials are very large, while Tensor Cores operate on small, fixed-size matrices. The proposed data layout and scheduling schemes are the novel "glue" that makes this mapping possible. They represent a new, domain-specific approach to data management for GPU-based computation.

Weaknesses

While the core mapping is novel, the work's claims of novelty do not extend to the underlying hardware or the performance gains themselves, which are a predictable consequence of the mapping.

[cite_start]Performance Gain is a Consequence, Not a Novelty: The paper reports significant speedups over CPU and other GPU baselines[cite: 542]. However, the novelty of the work is the enabling of the Tensor Cores, not the speedup itself. It is not a novel discovery that a highly specialized, massively parallel matrix engine (the Tensor Core) is faster than a general-purpose CPU or even general-purpose CUDA cores for matrix-heavy workloads. The performance gain is an expected and logical consequence of the novel mapping, not a separate innovation.
Relies Entirely on Existing Hardware: The novelty of this work is purely in the software and algorithmic domain. It proposes no new hardware and does not suggest any modifications to the existing GPU architecture. Its contribution is in finding a new and creative way to use hardware that already exists, which is valuable but limits the scope of the novelty to the mapping technique itself.
Algorithmic Ancestry: While the specific application to FHE is new, the general concept of using matrix multiplication engines to perform other mathematical operations (like convolutions) has a long history in the field of high-performance computing. The novelty here is the specific, non-trivial adaptation of these principles to the unique mathematical structures of lattice-based cryptography, particularly the Number Theoretic Transform (NTT).

Questions to Address In Rebuttal

The core of your novelty is the algorithmic mapping of polynomials to matrices. Can you discuss any prior art in other domains (e.g., signal processing, scientific computing) that has used a similar "transform-and-dispatch" approach to map non-matrix problems onto specialized matrix hardware?
[cite_start]The "limb-interleaving" technique is presented as a novel data layout[cite: 203, 206]. How does this differ fundamentally from standard data-marshaling techniques used in high-performance libraries (e.g., cuFFT, cuBLAS) to prepare data for optimal hardware access?
[cite_start]Could the proposed fragment-based scheduling [cite: 284] be considered a domain-specific instance of a more general tiling or loop-nest optimization strategy? What is the core, novel insight in your scheduler that is unique to the FHE domain?
If NVIDIA were to introduce a future Tensor Core with native support for modular arithmetic or larger integer types, how much of the novelty of the Neo framework would remain? Does the contribution lie primarily in overcoming the limitations of current hardware, or is there a more fundamental, hardware-independent algorithmic novelty?

Heliostat: Harnessing Ray Tracing Accelerators for Page Table Walks

Abstract

This paper introduces Heliostat, which enhances page translation bandwidth on GPUs by harnessing underutilized ray tracing accelerators (RTAs). While most existing studies focused on better utilizing the provided translation bandwidth, this paper ...

ACM LINK

Reviews

Review 1

Reviewer: The Guardian

Summary

This paper identifies the GPU's page translation bandwidth as a key performance bottleneck. To address this, the authors propose Heliostat, a system that offloads page table walks to the dedicated Ray Tracing Accelerators (RTAs) now common on modern GPUs. The core thesis is that the tree traversal performed during a page table walk is architecturally similar to the Bounding Volume Hierarchy (BVH) traversal performed by an RTA during ray tracing. By re-purposing this underutilized hardware, the authors claim to fundamentally increase translation bandwidth. A further optimization, Heliostat+, introduces a prefetching mechanism to proactively resolve future address translations. The authors claim this approach yields significant speedups (up to 1.93x) over baseline GPU memory management units.

Strengths

The paper is founded on a single, clever observation that serves as its primary motivation.

Identification of an Underutilized Resource: The core strength of the paper is its correct identification of a powerful and often-idle hardware unit—the RTA—and the subsequent attempt to find a general-purpose use for it. Recognizing that specialized accelerators may be repurposed is a valid direction for architectural research.

Weaknesses

Despite the clever initial idea, the paper's central thesis is built upon a fundamentally flawed analogy, questionable evaluation methodologies, and a significant underestimation of critical system overheads.

Fundamentally Flawed Architectural Analogy: The entire premise of Heliostat rests on the claimed "operational similarities" between page table walks and ray tracing (Abstract, Page 1). This analogy is superficial at best and deeply flawed in practice. A page table walk is a simple, deterministic pointer-chasing operation through a radix tree. In contrast, ray tracing BVH traversal is a complex geometric operation involving ray-box intersection tests, complex state management, and traversal decisions based on spatial properties. Using a highly specialized, complex engine designed for geometric tests to perform simple pointer lookups is a gross architectural mismatch. It is the equivalent of using a sledgehammer to crack a nut; while it might work, it is fundamentally inefficient and wasteful.
Unsubstantiated Performance Claims due to Inequitable Baseline: The headline performance claims (e.g., 1.93x speedup) are invalid because the comparison is not equitable. The paper compares its proposed system (GMMU + repurposed RTA) against a baseline with only the GMMU. An RTA is a large, power-hungry piece of silicon. A rigorous and fair comparison would evaluate Heliostat against a baseline where the GMMU is given an equivalent area and power budget to the RTA being repurposed. It is highly probable that a larger, more parallel, conventionally designed Page Table Walker unit would outperform the complex and inefficient Heliostat mechanism, rendering the claimed speedups an artifact of an unfair resource comparison.
Critical Overheads are Ignored: The paper fails to properly account for the significant overhead required to make this mechanism function. A page table walk request (a virtual address) must be "encoded" into a format that the RTA can understand—a "ray" with an origin and direction (Section 4.1, Page 4). Conversely, the RTA's output (hit information) must be decoded back into a physical address. This "translation tax" of encoding and decoding is not a free operation; it consumes cycles and energy. The performance analysis appears to completely ignore or minimize this overhead, which would likely negate a significant portion of the claimed benefits in a real implementation.
Unrealistic Prefetching Assumptions: The Heliostat+ extension relies on a "highly accurate" address prefetcher (Section 6, Page 7). However, the paper provides no evidence to support this claim of high accuracy. General-purpose, highly accurate address prediction is a notoriously difficult, and largely unsolved, problem in computer architecture. The evaluation appears to use benchmarks with regular, strided memory access patterns where prefetching is known to work well. The claim that this approach is beneficial for "any workloads" (Abstract, Page 1) is unsubstantiated, as it would likely perform poorly on irregular, pointer-chasing workloads where prefetching is ineffective.

Questions to Address In Rebuttal

Please provide a detailed breakdown of the cycle and energy overhead for the encoding/decoding process that translates a virtual address into a ray and a hit result back into a physical address. How does this "translation tax" impact the end-to-end latency of a single page walk?
To provide a fair comparison, please evaluate Heliostat against an improved baseline that features a conventional GMMU/PTW unit designed with the same silicon area and power budget as the RTA you are repurposing.
The Heliostat+ mechanism's performance is contingent on the accuracy of the address prefetcher. Please provide a sensitivity analysis showing how the performance of Heliostat+ degrades as prefetcher accuracy decreases, and justify your claim that this approach is effective for irregular, hard-to-predict workloads.
Can you justify the fundamental architectural choice of using a complex geometric intersection engine for a simple pointer-chasing task? What is the raw, cycle-for-cycle efficiency (in terms of lookups-per-second-per-mm²) of your proposed mechanism compared to a standard, dedicated Page Table Walker hardware implementation?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces Heliostat, a novel system for accelerating GPU page table walks by repurposing the dedicated Ray Tracing Accelerators (RTAs) that are now standard on modern GPUs. The core insight is that the tree traversal inherent in a page table walk is operationally similar to the Bounding Volume Hierarchy (BVH) traversal performed by RTAs during ray tracing. By creating a lightweight hardware and software layer to map page table walk requests onto the RTA's traversal engine, Heliostat aims to significantly boost the GPU's overall address translation bandwidth. A further proposed enhancement, Heliostat+, leverages the RTA's secondary ray generation feature to enable proactive, low-cost prefetching of future address translations. This work opens up a new avenue for system performance improvement by finding a general-purpose use for a highly specialized, and often underutilized, piece of hardware.

Strengths

This paper's primary strength is its creativity and its "out-of-the-box" thinking, which connects two seemingly unrelated parts of the GPU architecture to create a new and unexpected synergy.

A Brilliant Repurposing of Specialized Hardware: The most significant contribution of this work is its clever and non-obvious idea to use a ray tracing engine for memory management (Abstract, Page 1). Modern GPUs are increasingly becoming collections of specialized accelerators (e.g., Tensor Cores, RTAs). A major challenge in computer architecture is preventing these specialized units from becoming "dark silicon"—powered off and unused when their specific task is not running. Heliostat provides a compelling answer to this problem by finding a "secondary purpose" for the RTA, effectively democratizing a specialized unit for a general-purpose system task. This is a powerful and important direction for heterogeneous computing. 💡
Connecting Disparate Architectural Concepts: This work serves as an intellectual bridge between the worlds of computer graphics and computer architecture. It recognizes a deep structural similarity between two different problems: finding a ray's intersection in a 3D scene and finding a virtual address's mapping in a page table. By showing how the BVH traversal performed by an RTA can be conceptually mapped to a radix tree walk (Section 2.2, Page 2), the paper demonstrates a kind of architectural isomorphism that is both insightful and inspiring.
A New Pathway for Performance Scaling: For years, the primary approach to improving GPU memory management has been to build bigger TLBs or more parallel Page Table Walkers (PTWs) (Section 1, Page 1). Heliostat offers a completely new and orthogonal path to performance scaling. Instead of building more dedicated hardware, it leverages existing hardware more intelligently. This is a much more area- and power-efficient approach, as demonstrated by the paper's analysis (Section 8.5, Page 13), and it opens up a new dimension in the design space for future memory management units.

Weaknesses

While the core idea is brilliant, the paper could be strengthened by broadening its focus to the wider system implications and the long-term evolution of this concept.

The "Generalization" Challenge: The paper successfully demonstrates the mapping for page table walks. The natural next question is: what other general-purpose tree or graph traversal problems could be offloaded to the RTA? A discussion of how the Heliostat framework could be generalized to accelerate other important workloads, such as database index lookups, file system traversals, or even certain types of AI model inference, would elevate the work from a clever trick to a truly general-purpose platform.
The Software Ecosystem: The paper focuses on the hardware implementation. However, for Heliostat to be truly useful, it would need to be seamlessly integrated into the GPU driver and the broader OS memory management system. A discussion of the required software changes—for example, in the CUDA/ROCm runtime, the OS kernel's memory manager, and the compiler—would provide a more complete picture of the path to real-world deployment.
The Co-evolution of Hardware: Heliostat is a clever solution for today's GPUs. But what about tomorrow's? If this idea were to be adopted, it might influence the design of future RTAs. Future RTAs might be designed with more general-purpose traversal features from the ground up, making them even more powerful for non-graphics tasks. A discussion of how this work could influence the future evolution of GPU architecture would be a fascinating addition.

Questions to Address In Rebuttal

Your work brilliantly repurposes the RTA for a system-level task. Looking forward, what other common operating system or database algorithms (e.g., B-tree searches, file system lookups) do you think could be accelerated using the Heliostat framework?
The Heliostat+ prefetcher is based on simple stride detection (Section 6.4, Page 8). How could this be improved by leveraging more advanced prefetching techniques from the CPU world, and how would the RTA architecture need to evolve to support them?
For Heliostat to be practical, the OS and GPU driver need to be aware of it. What are the key modifications required in the software stack to manage the RTA as a general-purpose translation resource, and how would you handle security and isolation between different processes using it? 🤔
If you were designing the next generation of GPUs from scratch, knowing about the potential of RTA offloading, would you still design a separate GMMU and RTA? Or would you merge them into a single, more powerful, and general-purpose "Traversal Engine" capable of handling both graphics and memory management tasks?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents Heliostat, a system for accelerating GPU page table walks. The core novelty claim is the repurposing of dedicated Ray Tracing Accelerators (RTAs), now common on GPUs, to perform this critical memory management task. The central thesis is that the tree traversal operation at the heart of a page table walk is architecturally analogous to the Bounding Volume Hierarchy (BVH) traversal performed by an RTA. The authors propose a lightweight hardware and software layer to "encode" page table walk requests as ray tracing queries and offload them to the RTA. A second claimed novelty is Heliostat+, an extension that uses the RTA's secondary ray capabilities to implement a low-cost, proactive address prefetcher.

Strengths

The novelty of this work lies in its creative and non-obvious connection between two fundamentally different domains of computer architecture, leading to a new and unexpected use for specialized hardware.

A Novel Architectural Analogy: The most significant "delta" in this paper is the conceptual leap required to see the operational similarity between a memory management task (page table walking) and a graphics task (ray tracing) (Section 2.2, Page 2). While both involve tree traversals, they operate in entirely different semantic domains. Recognizing that the underlying hardware mechanism for one could be adapted for the other is a genuine and significant innovative insight. It challenges the conventional wisdom of designing dedicated hardware for every task and instead proposes a more resourceful approach. 💡
A New Mechanism for General-Purpose Acceleration: Prior work on hardware acceleration has almost exclusively focused on either building new bespoke ASICs or using programmable compute units (like CUDA cores). Heliostat proposes a third, novel path: repurposing a fixed-function accelerator for a task outside its original domain. This is a fundamentally new approach to heterogeneous computing. The hardware and software mechanisms designed to "trick" the RTA into performing page walks (Section 4, Page 4) represent a new class of architectural adaptation that has not been explored in prior literature.
Novel Application of Prefetching: The Heliostat+ extension is a clever and novel application of a graphics-specific feature for a general-purpose performance enhancement. Using the RTA's ability to spawn secondary rays to implement a low-cost, parallel prefetcher (Section 6, Page 7) is a non-obvious and elegant idea. It leverages a hardware capability that would otherwise be idle during non-graphics workloads to solve a classic and difficult problem in memory management.

Weaknesses

While the core idea is highly novel, its novelty is also its primary weakness. The work proposes a new use for existing hardware, but does not propose new fundamental hardware primitives.

Novelty is Purely in Abstraction and Mapping: The work does not propose any changes to the RTA or the core GPU architecture. Its novelty is entirely in the software and lightweight hardware layers that translate the page table walk problem into a ray tracing problem. This is a significant contribution, but it is a "trick" or a new mapping, not a new piece of fundamental hardware.
The Analogy is Imperfect: The core analogy, while clever, is not perfect. Ray tracing BVH traversal involves complex geometric calculations (ray-box intersection tests) that are entirely superfluous for a page table walk. The novelty lies in the ability to make this imperfect analogy work, but it also means the mechanism is not as efficient as a purpose-built Page Table Walker would be on a transistor-for-transistor basis. The novelty is in the "hack," not in the raw efficiency of the final implementation.
Performance is a Consequence, Not an Invention: The reported speedups are a direct and expected consequence of successfully offloading work to a powerful, parallel hardware unit that was previously idle. The novelty is in enabling this offload, not in the discovery that parallelizing a task makes it faster. The performance gains validate the novelty of the approach but are not, in themselves, a separate novel contribution.

Questions to Address In Rebuttal

Your work proposes a novel mapping of one problem (page walks) onto the hardware for another (ray tracing). Can you discuss any prior art in the broader history of computing where a fixed-function accelerator has been successfully and non-obviously repurposed for a task completely outside its original design domain?
The Heliostat+ prefetcher is a clever use of secondary rays. Is there any prior work in the graphics domain that has used this secondary ray feature for non-obvious, non-graphics-related prefetching or speculative computation?
The core of your novelty is the architectural analogy. If a future GPU were to include a more general-purpose "Tree Traversal Engine" instead of a specialized RTA, how much of the novelty of the Heliostat framework would remain? Does the contribution lie primarily in overcoming the limitations of today's specialized hardware?
The "encoding" of virtual addresses into rays is a key enabling technique (Section 4.1, Page 4). Can you elaborate on the novelty of this encoding scheme itself? Are there precedents for representing non-geometric data in a geometric format to leverage graphics hardware for general-purpose computation?

Forest: Access-aware GPU UVM Management

Abstract

With GPU unified virtual memory (UVM), CPU and GPU can share a flat virtual address space. UVM enables the GPUs to utilize the larger CPU system memory as an expanded memory space. However, UVM’s on-demand page migration is accompanied by expensive page ...

ACM LINK

Reviews

Review 1

Paper Title: Forest: Access-aware GPU UVM Management Reviewer Persona: The Guardian

Summary

The authors identify the performance limitations of the conventional Tree-Based Neighboring Prefetcher (TBNp) used in GPU Unified Virtual Memory (UVM) systems. They argue that TBNp's homogeneous, "one-size-fits-all" configuration is oblivious to the diverse memory access patterns of individual data objects, leading to unnecessary migrations and page thrashing. To address this, they propose Forest, a software-hardware co-design that introduces an on-device Access Time Tracker (ATT) to monitor page access sequences per data object. This information is then used by a driver-level Access Pattern Detector (APD) to classify patterns and configure a heterogeneous, object-specific prefetch tree. The paper claims significant performance improvements over the baseline TBNp and other state-of-the-art solutions, evaluated via simulation.

Strengths

Problem Motivation: The paper correctly identifies a valid and important limitation in existing UVM management systems. The core premise—that a single prefetcher configuration is suboptimal for diverse workloads and data structures—is sound. The analysis in Section 3, particularly Figure 4, provides clear evidence that different tree configurations benefit different applications, effectively motivating the need for a more adaptive approach.
Core Observation: The observation that UVM management should be performed at the granularity of individual data objects rather than fixed-size memory blocks or entire applications is insightful. This allows for a more tailored and potentially more efficient prefetching strategy.

Weaknesses

My primary concerns with this work center on the apparent arbitrariness of the core mechanism, the nontrivial and insufficiently justified hardware modifications, and the questionable fidelity of the simulation-based evaluation, which casts doubt on the extraordinary performance claims.

Arbitrary and Brittle Pattern Classification: The entire mechanism hinges on classifying data objects into one of four patterns (LS, HCHI, HCLI, LC), as defined in Section 4.3.2 (page 6). The thresholds used for this classification (e.g., R² > 0.8 for LS, access coverage P=0.6, access intensity A=0.4) are presented as fixed constants without any theoretical or empirical justification. This gives the impression that these are "magic numbers" tuned specifically for the evaluated benchmarks. The paper's own sensitivity analysis in Figure 19 (page 13) confirms this brittleness: performance degrades sharply if these exact thresholds are not used. This suggests the system is not robust and may perform poorly on workloads that do not fit neatly into these rigid, pre-defined boxes. The classification scheme itself feels overly simplistic for the complexity of real-world GPU access patterns.
Understated Hardware Cost and System Impact: The proposed Access Time Tracker (ATT) is presented as a minor modification that "repurposes the existing hardware page access counters" (Section 4.2, page 5). This is a misleading characterization. The paper states that existing counters reflect access frequency, while Forest requires them to store access recency (i.e., an ordered timer value). This is not a "repurposing"; it is a fundamental change to a core hardware monitoring feature. The authors fail to discuss the system-wide implications of this change. Do other system services, such as OS-level memory management, thermal throttling, or performance counters, rely on frequency data? If so, this design would break them. Furthermore, the hardware overhead in Section 7.8 is minimized as a "147-byte per-kernel" table, but the paper itself notes support for up to 128 concurrent kernels (page 6), which implies a total hardware cost of over 18KB. This is not a negligible hardware addition and its area and power costs are not analyzed.
Questionable Evaluation Fidelity and Baselines: The evaluation is conducted exclusively in GPGPU-Sim. While simulation is a standard tool, the reported speedups of 1.86x over the baseline TBNp are exceptionally high for a prefetching optimization. This raises serious questions about the fidelity of the baseline TBNp implementation. Production UVM drivers from vendors like NVIDIA are highly complex and aggressively optimized. It is highly probable that the simulated baseline is a simplified, less-performant version, which would artificially inflate the benefits of Forest. The paper provides no validation of its baseline against real hardware behavior, making it impossible to trust the magnitude of the claimed improvements. Without such validation, the results remain speculative.
Unrealistic Compiler and API Modifications: The "SpecForest" extension (Section 5) relies on static compiler analysis and, critically, proposes modifying the cudaMallocManaged API to pass hints from the compiler to the driver. Changing a fundamental, widely-used API in the CUDA ecosystem is a massive undertaking with significant backward-compatibility and software engineering implications. The paper glosses over this entirely. Moreover, the proposed static analysis for "similarity detection" (Section 5.3) appears fragile and would likely fail on any code with moderately complex pointer arithmetic or dynamically computed indices, limiting its real-world applicability.

Questions to Address In Rebuttal

On Pattern Classification: Please provide a rigorous justification for the specific classification thresholds chosen (R² > 0.8, P=0.6, A=0.4). How were these values derived? Given the performance sensitivity shown in Figure 19, please defend the claim that this mechanism is robust enough for general-purpose use and not simply overfitted to your benchmark suite. What is the performance impact of a misclassification?
On Hardware Modification: Please address the system-level impact of changing the fundamental behavior of hardware page access counters from tracking frequency to recency. Acknowledge and discuss which, if any, existing system functionalities would be compromised by this change. Provide a more realistic hardware cost analysis (area, power) for the full 18KB multi-kernel object table.
On Evaluation Fidelity: Please provide evidence to substantiate that your simulated TBNp baseline is a faithful representation of a production-quality UVM prefetcher. Can you present any data, even for a simple microbenchmark, that validates your simulator's UVM faulting behavior and performance against a real GPU? Without this, why should the reviewers trust the reported 1.86x speedup?
On API Changes: Please discuss the software engineering challenges and ecosystem-wide impact of modifying a core API like cudaMallocManaged. Is this a practical suggestion, and what would be the path to adoption?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper identifies a fundamental, yet previously unaddressed, inefficiency in the prevalent Tree-based Neighboring Prefetcher (TBNp) used in modern GPUs for Unified Virtual Memory (UVM) management. The authors compellingly argue that TBNp's "one-size-fits-all" homogeneous tree structure is oblivious to the diverse memory access patterns of different data objects, leading to suboptimal performance, unnecessary migrations, and page thrashing.

The core contribution is Forest, a novel access-aware UVM management system that dynamically configures a bespoke, heterogeneous prefetch tree for each individual data object at runtime. This is achieved through an elegant software-hardware co-design. A lightweight hardware unit, the Access Time Tracker (ATT), repurposes existing page access counters to record access recency rather than just frequency. This data is then consumed by a driver-level Access Pattern Detector (APD) that classifies object access patterns into one of four archetypes and configures the optimal prefetch tree structure accordingly. The paper further proposes Speculative Forest (SpecForest), which uses compile-time analysis and pattern recording to reduce or eliminate the runtime profiling overhead. The experimental results demonstrate significant speedups over both the baseline TBNp and other state-of-the-art solutions.

Strengths

Tackles a Foundational Problem: The most significant strength of this work is its insight. Rather than incrementally improving UVM performance by tuning migration thresholds or eviction policies (as much prior work has done), this paper questions the foundational assumption of a homogeneous prefetch architecture. By shifting the granularity of policy from the application-level to the object-level, Forest addresses what appears to be the root cause of many inefficiencies described in the motivation (Section 3, Pages 3-4). This is a conceptual leap that reframes the problem in a productive way.
Elegant and Practical Co-Design: The proposed hardware modification (the ATT) is commendably lightweight and practical. By repurposing existing page access counter infrastructure, the authors present a solution that seems plausible for integration into future hardware without a major overhaul. This pragmatism is a key feature that distinguishes it from more heavyweight academic proposals.
A Holistic and Layered Solution: The paper presents a complete system. Forest provides the dynamic runtime mechanism, while SpecForest provides a static/memoized optimization path to reduce overhead. The inclusion of SpecForest, with its use of pattern recording and static analysis for similarity detection (Section 5, Pages 9-10), shows a deep understanding of the practicalities of system performance, acknowledging that runtime profiling is not always the best solution.
Excellent Contextualization and Forward-Looking Vision: The discussion in Section 6 (Page 10) on the applicability of Forest to emerging heterogeneous architectures like Grace-Hopper is particularly insightful. The authors correctly identify that even with high-speed interconnects that enable remote access, intelligent data migration remains crucial due to bandwidth disparities. This demonstrates a panoramic view of the field and positions the work's core principles as durable and relevant for next-generation systems, not just current ones.
Strong and Convincing Motivation: The motivation presented in Section 3 is excellent. The data shown in Figure 4, demonstrating that no single tree configuration is optimal for all applications, and Figure 5, showing diverse patterns within a single application, provides a powerful and clear justification for the entire approach. This sets the stage perfectly for the proposed solution.

Weaknesses

While the core idea is strong, the work could be further contextualized and its boundaries explored. These are not flaws so much as opportunities for refinement.

Simplicity of the Pattern Taxonomy: The proposed four-pattern taxonomy (LS, HCHI, HCLI, LC) is a powerful simplification that enables the system's design. However, it's worth considering its limitations. Real-world access patterns can be phased, complex, or a hybrid of these archetypes. The paper implicitly handles this by defaulting to 'LC' for unrecognized patterns, but the performance implications of misclassification or an "unclassifiable" pattern could be explored more deeply. The taxonomy feels like a very effective first-order approximation, but the landscape of patterns is likely richer.
Interaction with Software-Managed Prefetching: The work is situated firmly in the context of hardware/driver-managed UVM. However, there is a parallel body of work on application-level or library-level prefetching (cudaMemPrefetchAsync). It would be interesting to understand how Forest might interact with such explicit prefetching directives. Could they conflict, or could the information from Forest's pattern detector be exposed to the programmer to guide better explicit prefetching?
Details of Static Analysis: The description of the static analysis for SpecForest is high-level. While the idea of detecting index similarity is intuitive and powerful (as shown in Listing 2, Page 9), the robustness of this analysis in the face of complex C++ templates, function pointers, and heavy pointer arithmetic—all common in sophisticated GPU codes—is an open question. A brief discussion of the limitations of this static analysis would strengthen the paper.

Questions to Address In Rebuttal

Regarding the four-pattern taxonomy: Can the authors comment on the prevalence of "unclassifiable" patterns that would default to the LC configuration? How sensitive is the system's performance if an object with a borderline pattern (e.g., between HCHI and HCLI) is misclassified?
The paper focuses on optimizing the prefetch tree structure. Could the access patterns identified by the APD also be used to inform a more intelligent eviction policy beyond the proposed object-level pseudo-LRU? For instance, data with a streaming (LS) pattern is inherently "dead" after access, suggesting it could be a priority candidate for eviction, irrespective of recency.
How does the system handle data objects that exhibit phased behavior, where the access pattern changes dramatically during a single kernel's execution? Does the profiling window (10K accesses) and cease bit mechanism risk locking in a suboptimal tree configuration based on the object's initial access pattern?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces "Forest," a novel access-aware management system for GPU Unified Virtual Memory (UVM). The core thesis is that the conventional one-size-fits-all, homogeneous Tree-Based Neighboring Prefetcher (TBNp) is inefficient for workloads with diverse memory access patterns. Forest's primary claimed novelty is the introduction of heterogeneous, per-data-object prefetcher configurations. This is achieved through a software-hardware co-design where a hardware "Access Time Tracker" (ATT) monitors page access sequences to infer patterns, which a software "Access Pattern Detector" (APD) in the UVM driver uses to dynamically reconfigure the TBNp tree structure (size and leaf granularity) for each data object. A secondary contribution, "Speculative Forest," aims to reduce this runtime profiling overhead via pattern recording and static compile-time analysis.

Strengths

The primary strength of this paper lies in its central, novel idea. My analysis of the prior art confirms that the core contribution is a genuine advancement in the field of UVM management.

Fundamental Shift in Prefetcher Management: The most significant novel contribution is the move from a static, homogeneous TBNp configuration to a dynamic, heterogeneous one. Previous state-of-the-art academic work [27, 29] has focused on adaptively tuning the parameters of the existing TBNp structure (e.g., migration thresholds) or refining eviction policies [26]. This paper is the first I am aware of to propose fundamentally altering the structure of the prefetcher tree itself (e.g., its total size and the granularity of its leaf nodes) on a per-data-object basis at runtime. This is a conceptual leap from tuning a prefetcher to reconfiguring it.
Novel Hardware Mechanism for Pattern Detection: The repurposing of existing hardware page access counters is a clever and elegant mechanism. Instead of tracking access frequency (hotness), the proposed ATT (Section 4.2, page 5) uses them to record access recency/order within an object. This provides the fine-grained sequence information necessary for pattern detection without requiring costly new hardware monitors. This specific application of access counters for temporal sequence tracking appears to be novel.
Lightweight Tree Reconfiguration Primitive: The introduction of two 1-bit metadata flags (isolation and motion) per non-leaf node (Section 4.4, page 7) is a novel and lightweight hardware primitive for enacting the dynamic tree reconfiguration. It allows the software to effectively create different logical tree structures from a single physical one, which is an efficient implementation of the core idea.
Novel Heuristics in Speculative Forest: While using static analysis to detect linear access patterns (Section 5.2, page 9) is a well-established technique, the proposed "access pattern similarity detection" (Section 5.3, page 10) is a novel and practical heuristic. Grouping data objects based on their use of the same indexing variables at compile-time to propagate a discovered pattern is a new idea in this context and cleverly reduces runtime overhead for complex access patterns.

Weaknesses

While the core idea is strong, the novelty of some constituent parts is less pronounced, and the work's novelty could be more rigorously defended against adjacent concepts.

Established Principles in Pattern Classification: The paper proposes a four-type access pattern taxonomy (LS, HCHI, HCLI, LC) in Section 4.3.2 (page 6). While this specific classification is tailored to the problem, the general concept of classifying memory accesses into categories like streaming/linear, strided, or irregular/random is a foundational concept in the history of prefetching research. The novelty lies in the application of this classification to reconfigure TBNp, not in the act of classification itself. The paper could be strengthened by acknowledging this and more clearly delineating where the established principle ends and their novel application begins.
Limited Scope of Novelty to TBNp Architecture: The solution is tightly coupled to the specifics of NVIDIA's TBNp. The discussion in Section 6 (page 10) briefly mentions how the idea could apply to AMD's range-based SVM, but the proposed hardware mechanisms (ATT, isolation/motion bits) are TBNp-specific. This raises the question of whether the novel contribution is a general principle ("access-aware, configurable prefetching") or a highly specific, albeit effective, point solution for TBNp. The paper's claim to novelty would be stronger if it better abstracted its core mechanisms from the TBNp implementation.
Incremental Novelty of Eviction Policy: The proposed pseudo-LRU eviction policy (Section 4.5, page 8) is a logical extension of the repurposed access counters. While an improvement over far-fault-based LRU, it is conceptually similar to other work that advocates for using hardware access information for eviction [27]. The key delta here is the use of recency over frequency and the two-level search (finding the LRU object, then the LRU page within it), which is a clever refinement but perhaps not a standalone novel contribution of the same caliber as the configurable tree.

Questions to Address In Rebuttal

Prior work, such as Early-Adaptor [29] and AdaptiveThreshold [27], dynamically adjusts UVM prefetching aggressiveness. Please clarify precisely how your proposed structural reconfiguration of the TBNp tree is fundamentally different from their approach of tuning migration thresholds within a fixed tree structure. Why is structural reconfiguration a more powerful and novel primitive?
The four proposed access patterns (LS, HCHI, HCLI, LC) are central to your method. Could you elaborate on the methodology used to arrive at this specific taxonomy? Was this set derived empirically from the chosen benchmarks, or is there a more fundamental basis for it? How sensitive is Forest's performance to this exact classification?
The discussion on applying Forest's principles to non-tree prefetchers like AMD's range-based SVM is brief. To better establish the generality of your novel idea, could you propose a concrete (if hypothetical) hardware/software mechanism, analogous to the isolation and motion bits, that would enable dynamic reconfiguration of migration "Ranges" based on detected access patterns?

Avant-Garde: Empowering GPUs with Scaled Numeric Formats

Abstract

The escalating computational and memory demands of deep neural networks have outpaced chip density improvements, making arithmetic density a key bottleneck for GPUs. Scaled numeric formats, such as FP8 and Microscaling (MX), improve arithmetic density by ...

ACM LINK

Reviews

Review 1

Paper Title: Avant-Garde: Empowering GPUs with Scaled Numeric Formats Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes Avant-Garde, a GPU microarchitecture designed to provide native hardware support for scaled numeric formats like MX and HBFP. The authors identify the software overhead (increased instruction count and register pressure) of managing these formats on conventional GPUs as a key performance bottleneck. The core idea is to "flatten" multi-level scaled formats into a consistent, single-level internal representation. This is achieved through two primary hardware modifications: a new pipeline stage called the "Operand Transformer" to perform the flattening, and a redesigned "Avant-Garde Tensor Core" that can directly operate on these flattened formats. The authors evaluate their proposal using a modified Accel-Sim simulator and claim significant throughput improvements (up to 74%) and execution time reductions (up to 44%) over a conventional GPU baseline, with negligible accuracy degradation.

While the problem statement is valid and the high-level architectural concept is directionally sound, the manuscript in its current form suffers from a fundamentally weak experimental baseline, questionable simulation assumptions, and an oversimplified analysis of hardware overhead and numerical precision. The impressive performance claims are not sufficiently substantiated against a robust point of comparison, and critical details are omitted.

Strengths

Problem Motivation: The paper does an excellent job of motivating the problem. The analysis in Section 2.2 (page 4), including the PTX instruction trace (Figure 3) and the quantification of register file and instruction overhead (Figure 4), clearly illustrates the inefficiency of supporting scaled numeric formats in software on current architectures. This part of the work is a valuable contribution.
Architectural Concept: The core architectural idea of converting various scaled formats into a unified internal representation ("flattening") is an elegant approach. It centralizes the complexity of handling diverse formats into a single, specialized hardware unit, which is a sensible design principle.

Weaknesses

Baseline Invalidity and Inflated Claims: The paper's primary claims of a 74% throughput improvement and 44% execution time reduction are built upon a comparison to a baseline that appears to be a strawman. The authors state (Section 4, page 9), "In the baseline, we implement a DNN model that handles the scaling factor in software to support the scaled numeric formats." This is a custom, likely unoptimized, software implementation. A dedicated hardware unit will naturally outperform a general-purpose software implementation. The critical scientific question is not if hardware is better, but by how much compared to a highly optimized, industrial-strength software library (e.g., a hypothetical cuDNN with native MX support). Without this comparison, the reported gains are likely vastly inflated and do not represent the true value proposition of the proposed hardware.
Superficial Hardware Overhead Analysis: The silicon overhead analysis presented in Section 3.3 (page 8) is unconvincing.
- Synthesizing on the FreePDK 45nm technology node is insufficient for projecting overhead on a modern, high-performance GPU built on a 5nm-class FinFET process. The relative cost of logic, memory, and routing is drastically different. A 1.4% area overhead on a 45nm SM is not a reliable proxy for the impact on a highly dense and complex modern SM, where routing congestion and timing closure for a new pipeline stage could present significant challenges.
- The paper claims a latency impact of "two cycles per warp" for the Operand Transformer. This is a fixed number that seems divorced from the complexity of the operation. The flattening process for a format with four scaling levels must surely take more time than one with two. This latency is not properly modeled or justified. The claim that this latency is always hidden by other warps is an optimistic assumption that will not hold in all execution scenarios.
Unjustified Simulation Simplifications: The methodology in Section 4 (page 8) contains a critical, unsupported assumption: "As Accel-Sim does not support FP8, we modify the simulator to compute a scaling factor so that FP8 operations execute with the same latency as INT8." This is fundamentally incorrect. FP8 and INT8 are not equivalent. FP8 arithmetic requires exponent handling (alignment, addition) and normalization, which necessitates different and potentially more complex hardware than integer multiplication. Equating their latencies is an oversimplification that biases the evaluation.
Undeclared Numerical Impact of Flattening: The authors claim in Section 5.5 (page 11) and Table 4 that flattening a multi-level format like MX9 results in virtually no accuracy loss (<0.2% vs FP32). This claim is not substantiated with a numerical analysis. The "flattening" process involves multiplying element values by their second-level scaling factors. This is not a lossless operation. It can easily lead to underflow (loss of precision for small values) or overflow/saturation (clipping of large values) when the result is stored in a fixed-width mantissa. The authors must provide a detailed analysis of the intermediate numerical formats and error bounds of the flattening operation itself, rather than just asserting that the final application-level accuracy is maintained. The statement "operand transformation introduces no significant loss in precision" is an unsubstantiated claim.
Omission of Critical Data: The sensitivity study in Section 5.6 (page 11) is a major point of concern. The authors state, "As the overall performance across configurations shows minimal variation, we omit a plot for this analysis." This is unacceptable in a rigorous scientific paper. This analysis is crucial for understanding the scalability and limitations of the proposed architecture. Hiding this data prevents reviewers from assessing the performance at corner cases (e.g., many scaling levels, very large block sizes) where the "less than 1% of total execution time" claim for flattening might break down.

Questions to Address In Rebuttal

Regarding the Baseline: Please justify why your custom software baseline is a fair comparison point. Can you provide any evidence or theoretical argument to suggest that an industry-optimized software library for MX formats would not significantly close the performance gap you report?
Regarding Hardware Cost: Can you provide a more robust analysis of the hardware overhead? Specifically, how do your overhead projections change when considering a modern process node (e.g., 7nm or 5nm)? What is the cycle latency of the Operand Transformer as a function of the number of scaling levels (L), and how was this latency determined?
Regarding Simulation Fidelity: Please defend the assumption that FP8 and INT8 operations have identical latency. What is the justification for this simplification, and how would the results change if a more realistic latency model for FP8 arithmetic were used?
Regarding Numerical Precision: Please provide a detailed numerical analysis of the flattening operation. What is the bit-width of the internal datapaths within the Operand Transformer? How do you handle potential overflow and underflow during the multiplication of elements by scaling factors to guarantee that precision is maintained, as you claim?
Regarding Omitted Results: Please provide the full data and plots for the sensitivity study described in Section 5.6. Specifically, show the impact on total execution time as a) the number of scaling levels is varied from 1 to 4, and b) the block size is varied from 32 to 512.

Review 2

Paper Title: Avant-Garde: Empowering GPUs with Scaled Numeric Formats Reviewer Persona: The Synthesizer (Contextual Analyst)

Summary

This paper identifies a critical and growing bottleneck in modern GPU architectures: the inefficient, software-based handling of advanced scaled numeric formats like MX and HBFP. The authors compellingly argue that while these formats are crucial for improving arithmetic density, the software overhead in managing their hierarchical scaling factors (e.g., per-tensor, per-block, per-subset) negates many of their potential performance benefits, primarily through increased instruction count and register pressure.

The core contribution is Avant-Garde, a novel microarchitectural extension for GPUs designed to natively support these diverse formats. The central idea is elegant: a hardware module called the "Operand Transformer" intercepts operands in their scaled format and "flattens" them into a canonical, single-level internal representation. This flattened format, consisting of a single shared scaling factor and a block of elements, can then be processed efficiently by a redesigned Tensor Core. This approach effectively moves the costly format conversion out of the software domain and into a dedicated, low-latency pipeline stage, thereby unifying the computation pipeline for a wide array of present and future scaled numeric formats. The authors demonstrate significant throughput (up to 74% higher) and execution time (up to 44% lower) improvements with negligible accuracy loss.

Strengths

Excellent Problem Formulation and Motivation: The paper does a superb job of contextualizing its contribution. The analysis in Section 2, particularly the illustration of the PTX instruction stream (Figure 3, page 4) and the quantification of register and instruction overhead (Figure 4, page 4), provides a clear and convincing motivation. The authors are not solving a contrived problem; they are addressing a tangible and increasingly relevant challenge at the intersection of DNN model design and hardware architecture.
Elegant and Generalizable Core Concept: The central idea of "flattening" is a powerful architectural pattern. By creating a canonical internal representation, the design avoids the trap of building bespoke hardware for every new numeric format (FP8, MX4, MX6, etc.). Instead, it provides a general mechanism that can potentially accommodate future formats that fit the scaled numeric paradigm. This abstraction layer between the programmer-visible format and the internal execution format is a hallmark of strong architectural design.
Timeliness and High Potential Impact: This work is situated perfectly within the current landscape of AI hardware research. As the industry, led by efforts like the Open Compute Project (OCP) for Microscaling, moves towards standardizing sub-8-bit formats, the need for efficient hardware support becomes paramount. Avant-Garde provides a well-reasoned blueprint for how major GPU vendors could integrate such support. If adopted, this approach could significantly accelerate the adoption of more aggressive quantization schemes for both training and inference, unlocking further gains in model efficiency.
Holistic Architectural Vision: The proposal is not just a single-trick module; it is a coherent set of microarchitectural extensions. The combination of the Operand Transformer, the redesigned Tensor Core, and the corresponding API (Section 3.2, page 7) presents a complete solution. It considers the full data path from memory to execution and back, providing a practical and seemingly implementable design.

Weaknesses

While the core idea is strong, the paper could benefit from a deeper exploration of its implications and limitations:

The "Unflattening" Path for Training: The paper's primary focus is on the forward (flattening) path, which is dominant in inference. The reverse path—"unflattening" updated weights back into their original scaled format during training—is discussed more briefly (page 8). The authors state this process leverages CUDA cores and has minimal impact due to its infrequency. However, this complex data-dependent transformation (requiring finding new scaling factors, etc.) could become a non-trivial overhead in training-intensive workloads or novel training algorithms. A more quantitative analysis of this reverse path would strengthen the paper's claims for training efficiency.
Programmability and Extensibility of New Formats: The API presented in Figure 9 (page 7) seems to rely on predefined formats (scaled mx9;). While this is practical, it raises questions about the system's true flexibility. How would a researcher experimenting with a novel scaled format (e.g., a three-level hierarchy or non-power-of-two block sizes) utilize Avant-Garde? A more detailed discussion on the boundary between what the hardware can parametrically support and what would require new software libraries or microcode would better define the limits of the proposed architecture's "future-proofing."
Interaction with Orthogonal Optimizations: Modern GPUs incorporate many specialized features beyond numeric formats, most notably hardware support for sparsity. How does the flattened internal representation interact with structured or unstructured sparsity? Does it create new opportunities for compression, or does it potentially complicate the identification of zero-value blocks? Placing Avant-Garde in the context of other major architectural trends like sparsity and data compression would provide a more complete picture of its role in a future GPU.

Questions to Address In Rebuttal

Could the authors provide a more detailed analysis of the overhead associated with the "unflattening" process required for training? For instance, what percentage of total training time for one epoch of a model like BERT or GPT-2 would be spent in this CUDA-based weight reorganization? Is there a scenario where this could become a bottleneck?
Regarding the API's extensibility: Could you clarify the mechanism by which a user could define and use a novel scaled numeric format not pre-compiled into the driver? What are the specific parameters (e.g., number of levels, block sizes, bitwidths) that the Operand Transformer hardware can handle dynamically?
Could you speculate on the interplay between Avant-Garde's flattening mechanism and hardware support for sparsity? For example, would sparsity checks be more efficient on the original multi-level format or on the flattened, single-level internal representation?

Review 3

Paper: Avant-Garde: Empowering GPUs with Scaled Numeric Formats Review Form: The Innovator (Novelty Specialist)

Summary

The paper proposes Avant-Garde, a GPU microarchitecture designed to natively support diverse scaled numeric formats, particularly those with multiple scaling levels like the Microscaling (MX) format. The authors identify that current GPUs rely on inefficient software-based methods to handle these formats, leading to significant instruction and register overhead. The central novel mechanism proposed is "operand flattening," where multi-level formats are converted in hardware by a new "Operand Transformer" pipeline stage into a canonical, single-level internal representation. This "flattened" format, which is functionally equivalent to a block floating-point (BFP) representation, is then processed by a modified Tensor Core. The authors claim this approach eliminates software overhead, achieving significant throughput and execution time improvements with modest hardware cost.

Strengths

The primary strength of this work lies in its novel microarchitectural approach to a well-known and increasingly important problem. While the challenge of supporting complex numeric formats is not new, the proposed solution is architecturally elegant.

Novel Architectural Pattern: The core idea of "flattening" multi-level formats into a canonical single-level internal representation is a clean and compelling architectural pattern. It effectively decouples the complexity of handling a diverse and evolving ecosystem of external numeric formats from the design of the core arithmetic units. This creates a stable internal interface that simplifies the computational core and allows for future format support with potentially minimal changes to the execution units.
Instantiation of a Hardware Conversion Stage: The novelty is not in the mathematical conversion itself, but in the specific proposal to instantiate this conversion as a dedicated hardware stage (the "Operand Transformer") within a general-purpose GPU pipeline. This is a concrete and well-defined microarchitectural contribution that directly addresses the software overhead documented in Section 2.2 (Page 4, Figure 4).
Justified Complexity: The authors present a clear trade-off analysis. The proposed hardware additions are reported to be modest (Section 3.3, Page 8), while the performance benefits are substantial. This indicates that the novel complexity introduced is well-justified by the gains it provides, a crucial aspect for any new architectural feature.

Weaknesses

My critique focuses on the precise boundaries of the novelty and the paper's positioning relative to existing concepts.

The "Flattening" Concept is a Re-framing of a Known Conversion: The underlying concept of "flattening" is, in essence, a pre-computation of scaling factors to convert a hierarchical format (like MX) into a standard Block Floating-Point (BFP) format. The paper's novelty rests entirely on the proposal to instantiate this conversion as a dedicated hardware pipeline stage, not on the invention of the conversion process itself. The manuscript could be more explicit in distinguishing its microarchitectural novelty from the underlying mathematical operation, which is not new.
Insufficient Comparison with Prior Accelerator Designs: The Related Work section (Section 6, Pages 11-12) discusses other accelerators for scaled formats (e.g., MSFP-based [41], MX-based [42]), but it does not sufficiently analyze their internal datapath designs. The critical question for evaluating novelty is: how did prior dedicated accelerators handle multi-level scaling? Did they also use an explicit conversion to an internal BFP-like format, or did they use a more complex, on-the-fly arithmetic unit that directly managed the hierarchy? Without this direct comparison, the uniqueness of the "Operand Transformer" as an architectural strategy is not as sharply defined as it could be. For instance, accelerators from works like [24] or [39] might contain logic that is functionally equivalent to flattening, even if not described with that term.
Limited Novelty of the Computational Core: The design of the "Avant-Garde Tensor Core" (Section 3.2, Page 7), which consumes the flattened format, is functionally very similar to previously proposed BFP or HBFP arithmetic units (e.g., [9], [50]). These units are designed to perform dot products on blocks of values sharing a single exponent/scaling factor. The paper should more clearly state that the novelty is concentrated in the production of this BFP-like format by the Operand Transformer, rather than its consumption by the Tensor Core. The contribution is the bridge, not the destination.

Questions to Address In Rebuttal

Could the authors please clarify the distinction between the mathematical operation of flattening (which is a known conversion from a hierarchical format to a BFP format) and their specific microarchitectural contribution? Is the primary novel claim the proposal of a dedicated, explicit pipeline stage for this conversion within a GPU architecture?
Can the authors provide a more detailed comparison of their "Operand Transformer" approach to the internal datapath designs of prior art in dedicated MX-specific accelerators [24, 39, 42]? Specifically, do these accelerators also convert to a canonical internal representation, or do they employ a different strategy (e.g., on-the-fly scaling factor application)? A direct comparison of architectural strategies would significantly strengthen the paper's novelty claim.
The proposed architecture maintains the flattened format in the register file and memory for potential reuse. What are the microarchitectural complexities and overheads associated with managing both the original and flattened data representations in the memory subsystem, particularly in scenarios with frequent register spills/fills or complex data reuse patterns that might require coherence between the two forms?

The XOR Cache: A Catalyst for Compression

Abstract

Modern computing systems allocate significant amounts of resources for caching, especially for the last level cache (LLC). We observe that there is untapped potential for compression by leveraging redundancy due to private caching and inclusion that are ...

ACM LINK

Reviews

Review 1

Paper Title: The XOR Cache: A Catalyst for Compression Reviewer Persona: The Guardian (Adversarial Skeptic)

Summary

The paper introduces the "XOR Cache," a compressed Last-Level Cache (LLC) architecture that aims to exploit data redundancy stemming from inclusive cache hierarchies. The core mechanism involves storing the bitwise XOR of two cache lines in a single LLC slot, provided at least one of the original lines is present in a higher-level (private) cache. This inter-line compression is intended to not only save space but also to act as a "catalyst" for existing intra-line compression schemes by reducing the entropy of the resulting XORed line. The authors present a modified coherence protocol to manage these compressed line pairs and provide an evaluation showing significant area/power savings with what they claim is a minor performance overhead.

While the premise of leveraging inclusion redundancy is conceptually interesting, the work suffers from critical weaknesses in its evaluation of performance overhead, the justification of its complex coherence protocol, and its claims of superiority over simpler, more established baselines. The presented benefits appear to come at the cost of unquantified complexity and potentially severe performance penalties in scenarios not fully captured by the evaluation.

Strengths

Novel Premise: The core concept of transforming the inclusion property from a source of redundancy into an opportunity for compression is clever and thought-provoking.
Synergistic Potential: The paper provides a compelling, albeit idealized, motivation in Section 1.2 (Figure 2) that XORing similar lines can significantly boost the effectiveness of existing intra-line compression schemes.
Comprehensive Baselines: The evaluation compares the proposed design against a reasonable set of state-of-the-art compression schemes (BAI, BPC, Thesaurus) and, importantly, an Exclusive LLC baseline, which represents a direct and simpler alternative for mitigating inclusion-based redundancy.

Weaknesses

Gross Underestimation of Coherence and Decompression Overhead: The central claim of a mere 2.06% average performance overhead is highly suspect and appears inconsistent with the paper's own description of the protocol.
- Complex Recovery Paths: The decompression process for an LLC hit on an XORed line is not a simple, single-cycle operation. As described in Section 4.3 and Figure 7, cases like "remote recovery" require a multi-hop sequence: request to LLC -> LLC forwards request + XORed data to a partner sharer -> partner sharer performs XOR and forwards data to the original requestor. This is an extremely long latency path for what should be an LLC hit.
- Expensive UnXORing Operations: The unXORing mechanism, required for writes (getM), last-sharer evictions (putS), and data evictions (Section 4.4), is fundamentally expensive. It necessitates issuing a "special write-back request" to a private cache to retrieve the partner line, waiting for the data, performing the XOR, and only then proceeding. Stalling an incoming write request while this multi-message round trip completes will introduce significant latency.
- Contradictory Latency Claims: The paper quantifies the message overhead at 18.2% and network traffic increase at 23.4% (Section 6.4.2), yet dismisses the performance impact. This is a major contradiction. Such a significant increase in traffic and dependency-chains in the coherence protocol cannot plausibly result in a near-negligible 2.06% slowdown. The authors' assertion that future network bandwidth scaling will absorb this cost is speculative and constitutes hand-waving.
Insufficient Justification for Protocol Correctness (Deadlock Freedom): The argument for deadlock freedom in Section 4.5 is unconvincing. The authors state they use Murphi for single-address verification and then "analytically evaluate" for multiple addresses. Analytical proofs for complex, unblocking coherence protocols are notoriously difficult and prone to missing subtle corner cases involving request/response ordering and resource dependencies. Given that the protocol introduces new inter-line dependencies (a request for line B can trigger a write-back request for line A), a more rigorous formal proof or a more comprehensive model-checking setup is required to substantiate the claim of deadlock freedom without extra virtual networks.
Unconvincing Advantage Over a Simpler, Stronger Baseline: The Exclusive LLC is a critical baseline because it also eliminates redundancy due to inclusion, but does so with a much simpler coherence implementation. According to the authors' own data in Figure 14, XOR Cache+BDI is only 1.30x smaller than an Exclusive LLC+BDI. Is the immense complexity of the XOR Cache coherence protocol—with its multi-hop recovery paths, special write-back requests, and inter-line dependencies—a justifiable trade-off for this marginal density improvement? The paper fails to make this case convincingly. A simpler path to higher density would be to apply a stronger intra-line compression scheme to the Exclusive LLC.
Ambiguous Critical Path Description: There is a critical ambiguity in the read path description. Section 5.2.1 states that for a read on an XORed line, the request "also needs to access the XORed partner's coherence metadata." The text states: "First, we follow XORPtr to access the tag entry of the XOR partner. Then, a second lookup in the directory is performed." This describes a serial lookup process. This serialized dependency on the critical path of an LLC hit is a major source of latency that does not appear to be adequately modeled or justified in the performance results.

Questions to Address In Rebuttal

Please provide a detailed cycle-level breakdown of the three data forwarding cases (local, direct, remote recovery) for an LLC read hit on an XORed line. How do these latencies compare to a standard LLC hit, and how can the average performance overhead be only 2.06% when "remote recovery" is plausibly tens or hundreds of cycles long?
Similarly, what is the latency penalty for a getM request to an XORed line that triggers the unXORing process? Please provide a cycle-level breakdown of this critical operation.
The paper presents aggregated performance overheads. Please provide per-benchmark performance results. It is crucial to see if the 2.06% average hides cases of catastrophic slowdowns on workloads that heavily trigger the complex recovery and unXORing paths.
Please provide a more rigorous argument for deadlock freedom in the multi-address case. The current "analytical evaluation" is insufficient. What specific race conditions were considered, and how does the protocol prevent request-dependency cycles (e.g., Core1 waiting on Core2 for an unXOR dependency, while Core2 is waiting on Core1 for an unrelated resource)?
Please clarify the read access path described in Section 5.2.1. Are the lookups for the requested line's tag and the partner's directory entry performed in parallel or series? If in series, how is this significant latency accounted for in the evaluation?
Justify the decision to introduce the XOR Cache's complexity over simply using an Exclusive LLC, which also eliminates inclusion redundancy. A 30% area improvement over this baseline may not be worth the verification effort and potential performance risks of the proposed coherence protocol.

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces the XOR Cache, a novel LLC architecture designed to reduce the area and power footprint of the last-level cache. The core idea is to exploit the data redundancy inherent in modern inclusive cache hierarchies. The mechanism is twofold:

Inter-line Compression: It stores the bitwise XOR of two cache lines (A⊕B) in a single LLC slot. This is only possible when at least one of the original lines (e.g., A) is guaranteed to exist in a higher-level private cache, which serves as the "key" for decompression. This effectively turns the "wasted" capacity of inclusive hierarchies into a compression opportunity.
Intra-line Compression Catalyst: By carefully selecting similar lines (A ≈ B) to XOR, the resulting line (A⊕B) has significantly lower entropy (i.e., is full of zeros). This low-entropy data is far more compressible by existing intra-line compression algorithms (like BDI, BPC), thus acting as a "catalyst" that boosts their effectiveness.

The authors present a detailed coherence protocol to manage these compressed pairs, analyze the design of the XOR pairing policy, and provide a comprehensive evaluation showing significant improvements in LLC area (1.93x) and power (1.92x) for a modest performance overhead (~2.06%), leading to a 26.3% reduction in the energy-delay product.

Strengths

The primary strength of this work lies in its elegant and insightful re-framing of a long-standing architectural feature.

Novel Conceptual Contribution: The idea of leveraging inclusion-based redundancy as a primitive for compression is the paper's most significant contribution. For years, the data duplication in inclusive caches has been viewed primarily as a drawback (wasted space) or a simple enabler for coherence snooping. This work cleverly transforms it into a core asset for compression, providing one half of the XOR pair "for free." This is a fundamental shift in perspective.
Powerful Synergy: The "catalyst" concept is compelling. Instead of proposing yet another standalone compression algorithm to compete with the state-of-the-art, the XOR Cache is designed to be a synergistic substrate that improves them. This makes the work broadly applicable and positions it as a foundational enhancement rather than a mutually exclusive alternative. The potential to nearly double the effectiveness of existing schemes, as suggested by the idealBank results in Figure 2 (Page 3), is profound.
Architectural Completeness: The authors go beyond the high-level idea and tackle the difficult implementation details, particularly the necessary cache coherence protocol modifications (Section 4, Page 5). The detailed discussion of compression, decompression, and "unXORing" flows, along with the analysis of deadlock freedom, demonstrates a mature and thorough architectural consideration that is often missing in more conceptual papers.
Strong Connection to Physical Design: The choice of XOR as the operator is not arbitrary. It connects directly to a body of work on in-SRAM computing and logic, where bitwise operations can be performed with extreme efficiency [2, 51]. This grounds the proposal in physical reality and strengthens the claims of low overhead for the compression/decompression logic itself.

Weaknesses

The weaknesses are less about fundamental flaws in the idea and more about its dependencies and the complexities it introduces.

Dependence on Cache Hierarchy Policy: The entire mechanism is predicated on an inclusive or at least partially-inclusive (NINE) hierarchy where the "key" line can be located in an upper cache. While inclusive hierarchies are common, there is also a significant design space for strictly exclusive or NINE hierarchies [6, 55]. The paper doesn't fully explore how the XOR Cache's benefits would degrade or how the protocol would need to adapt in a non-inclusive environment where the presence of the key is not guaranteed. This is a crucial piece of context in the broader landscape of cache design.
Complexity of Coherence and Recovery: The proposed coherence protocol is necessarily more complex than a baseline MSI protocol. The "remote recovery" path described in Section 4.3 and Figure 7 (Page 6) appears to introduce significant latency: a request for line B could trigger a multi-hop sequence involving forwarding the request and the XORed data to a sharer of line A, which then performs the computation and forwards the result. While the average performance impact is shown to be low, the worst-case latency and its potential to create critical path stalls in certain applications is a concern. The 18.2% increase in coherence messages is also a non-trivial system-level cost.
The Efficacy of the Pairing Policy: The synergistic "catalyst" effect is entirely dependent on the ability of the map table (Section 5.1.3, Page 7) to find similar lines to pair. This approach places the XOR Cache in the same family as other hash-based schemes like Thesaurus [24], making it susceptible to hash collisions and the limitations of the chosen map function. While the results with 7-bit SBL are good, the gap between this practical implementation and the idealBank potential shown in Figure 2 remains large, suggesting the pairing heuristic is a critical and sensitive component of the design.

Questions to Address In Rebuttal

Beyond Inclusion: Could the authors elaborate on the viability of the XOR Cache concept in a non-inclusive (NINE) or strictly exclusive cache hierarchy? How would the "minimum sharer invariant" be maintained if inclusion is not guaranteed? Would it require a separate mechanism to track private cache contents, and what would the overhead of such a system be?
Worst-Case Performance Impact: The average performance overhead is presented as a modest 2.06%. However, could the authors provide more insight into the tail latency or worst-case slowdown? Specifically, how frequent is the multi-hop "remote recovery" path, and are there specific workload types (e.g., with irregular sharing patterns) that are disproportionately impacted by this longer-latency decompression path?
Scalability of Coherence Traffic: The paper notes a 23.4% increase in network traffic due to forwarding messages (Section 6.4.2, Page 11). While this may be acceptable in a 4 or 8-core system, how do the authors see this scaling in future many-core (64+ cores) or chiplet-based architectures where interconnect bandwidth is a primary performance limiter and energy consumer? Does this overhead threaten to erode the power savings from a smaller LLC at scale?

Review 3

Review Form: The Innovator

Summary

The paper introduces the "XOR Cache," a last-level cache (LLC) architecture designed to enhance compression by exploiting data redundancy inherent in modern inclusive and non-inclusive, non-exclusive (NINE) cache hierarchies. The core technical proposal is a form of inter-line compression where two cache lines (A and B) are stored in a single LLC slot as their bitwise XOR value (A⊕B). The mechanism relies on the fact that for an inclusive LLC, if line A is also present in a higher-level private cache (e.g., L1), it can be used as a key to reconstruct line B from the stored A⊕B on an LLC access. The authors claim this approach has a dual benefit: 1) it achieves a near 2:1 inter-line compression ratio, and 2) it acts as a "catalyst" for traditional intra-line compression schemes, as XORing two similar lines (A ≈ B) results in a low-entropy line (A⊕B) that is more compressible. The authors propose a corresponding coherence protocol and architectural extensions to support this mechanism.

Strengths

The primary strength of this work lies in its novel conceptual framing. The key innovation is not merely the use of XOR for compression—a well-established technique—but its specific application to exploit a feature of modern cache hierarchies that is almost universally regarded as a liability: data redundancy due to inclusion.

Novel Repurposing of a System Drawback: Prior work has predominantly focused on mitigating the negative effects of inclusion (e.g., wasted capacity, back-invalidations) by moving towards NINE or exclusive hierarchies [15, 26, 55], or by designing clever replacement policies [48]. This paper presents a paradigm shift by treating the duplicated data in private caches not as waste, but as a necessary key for decompression. This reframing of a problem into a solution is a significant and elegant conceptual contribution.
Synergistic Compression Model ("Catalyst"): The paper's second novel claim—that this inter-line XORing can catalyze intra-line compression—is well-argued. While pre-processing data to improve its compressibility is a known concept (e.g., the transformations in BPC [30]), the specific mechanism of using value-similarity-aware inter-line compression to directly boost an orthogonal intra-line compression scheme appears to be a new and powerful combination.
Distinction from Prior Art: The authors have correctly identified the closest prior art and articulated the novelty of their approach. For instance, while in-SRAM XOR compression has been proposed (Wang et al. [51]), the authors rightly note that such work does not target or leverage redundancy from inclusion across the hierarchy. Similarly, this work is distinct from deduplication schemes [49] which target identical, not merely similar, lines, and from approximate caches [46, 47] as it provides lossless reconstruction. The novelty is in the system-level integration and the cross-level dependency, which is a unique mechanism.

Weaknesses

While the core idea is novel, its contribution and applicability are tightly bound to specific architectural assumptions. The evaluation of novelty must consider the context in which it operates.

Narrowing Applicability of the Core Premise: The entire innovative mechanism is predicated on the existence of data redundancy between the LLC and private caches. While currently common, there is a clear industry and academic trend towards NINE and strictly exclusive LLCs to maximize effective capacity. The novel contribution of this work, therefore, exists within a potentially shrinking design space. In a strictly exclusive hierarchy, the core mechanism is fundamentally broken, and the entire contribution vanishes. The novelty is therefore conditional, not universal.
Complexity vs. Incremental Novelty: The core operation, XORing two blocks, is not new. The novelty is exclusively in its system-level application, which requires a substantial increase in complexity: a redesigned coherence protocol with new forwarding paths (Section 4), deadlock considerations (Section 4.5), and new hardware structures like the map table (Section 5.1). While the performance results seem to justify this complexity, one must question if a less complex, and therefore less novel, approach could achieve similar gains. For example, would simply investing the same transistor budget into a more powerful standalone intra-line compression scheme on a conventional NINE cache yield comparable results without the coherence overhead? The paper does not sufficiently argue that this complex, cross-level approach is fundamentally superior to enhancing existing single-level techniques.

Questions to Address In Rebuttal

On the Delta vs. Self-Contained XOR Compression: The authors differentiate their work from Wang et al. [51] by highlighting the focus on inclusion redundancy. This is the critical delta. Could the authors elaborate on the fundamental trade-offs of their cross-level-dependent recovery mechanism versus a hypothetical intra-LLC scheme where both lines (or a base and XORed delta) are recoverable from within the LLC itself? What are the specific performance and complexity advantages that make the reliance on private caches a superior novel approach?
On the Longevity of the Novel Contribution: The novelty of this work is contingent on inclusive/NINE hierarchies. Could the authors comment on the applicability of their core idea to future systems that may aggressively pursue strict exclusivity to maximize capacity? Is there any facet of the "XOR Cache" concept that could be adapted to provide a novel benefit in such a context, or is the contribution strictly limited to non-exclusive hierarchies?
On the "Catalyst" Effect as a Standalone Concept: The synergy between inter- and intra-line compression is a key claim. How much of the benefit of the "catalyst" effect is due to the specific XOR mechanism versus the general principle of pairing similar lines? Could a simpler pre-processing hash or signature, used to identify similar lines for a state-of-the-art intra-line compressor, achieve a significant fraction of the "catalyst" benefit without the need for the full XOR Cache architecture and coherence protocol? This would help isolate the novelty of the complete system from the novelty of its constituent ideas.

Precise exceptions in relaxed architectures

Abstract

To manage exceptions, software relies on a key architectural guarantee,precision: that exceptions appear to execute between instructions. However, this definition, dating back over 60 years, fundamentally assumes a sequential programmers model. Modern ...

ACM LINK

Reviews

Review 1

Paper Title: Precise exceptions in relaxed architectures Reviewer: The Guardian

Summary

This paper investigates the interaction between exceptions and relaxed memory models, focusing on the Arm-A architecture as a representative case. The authors identify the inadequacy of the classic, sequentially-consistent definition of "precise exceptions" in the context of modern multicore processors. They explore various relaxed behaviors across exception boundaries using a suite of litmus tests, conduct preliminary hardware testing, and propose an extension to an existing axiomatic model to account for some of these behaviors. The paper concludes by sketching a model for software-generated interrupts (SGIs) and their role in synchronization.

While the paper addresses an undeniably important and under-explored area, its contributions are undermined by a limited scope, insufficient empirical validation, and a reliance on informal specifications. The proposed models, particularly for the Generic Interrupt Controller (GIC), are presented as "drafts" and "sketches," raising questions about the completeness and maturity of the work for a top-tier venue like ISCA.

Strengths

Problem Identification: The paper correctly identifies a critical and long-overlooked gap in architectural specifications: the definition of exception precision in the presence of relaxed memory behaviors. This is a fundamentally important problem for systems software developers and architects.
Executable Model Integration: The use of the Isla tool to create an executable-as-a-test-oracle for the axiomatic model (§5.1, pg 8) is a methodologically sound approach. Providing an executable artifact for exploring behaviors is a valuable contribution.
Litmus Test Curation: The development of a library of litmus tests (§3.2, pg 4) to probe specific out-of-order behaviors across exception boundaries provides a concrete basis for discussion and future testing efforts.

Weaknesses

Severely Limited and Incomplete Models: The paper’s claims of providing semantic models are significantly overstated.
- The model for software-generated interrupts is explicitly called a "draft axiomatic extension" (§7.5, pg 12). The authors acknowledge that a large quantity of the GIC architecture, a component with a 950-page specification, would need to be formalized (§1.2, pg 2; §7.5, pg 12). Presenting a "sketch" of such a complex component is insufficient; this is exploratory work, not a complete model.
- The paper explicitly scopes out imprecise exceptions and the behavior of 'constrained unpredictable' states (§1.2, pg 2). This is a major omission. A robust treatment of exceptions cannot simply ignore the most difficult cases, as they are often where the architectural guarantees are weakest and most critical. The title promises a treatment of "precise exceptions," but a full understanding requires contrasting them with the alternative.
Insufficient Empirical Validation: The hardware testing (§3.6, pg 6) is not comprehensive enough to substantiate the paper's claims about architectural behavior.
- The test suite comprises only 61 hand-written tests. For the vast state space of relaxed memory and exception interactions, this is a very small sample. The authors themselves note that the suite "is relatively small" and "ideally could be auto-generated" (§1.2, pg 2), which concedes a major methodological weakness.
- The results table (Fig 9, pg 6) is replete with U outcomes (allowed but unobserved). An absence of observed behavior on a limited set of microarchitectures is not strong evidence that the behavior is disallowed across the entire architecture or that the model is correct. This is particularly weak for forbidding behaviors.
- The selection of hardware, while varied, is not exhaustive. Key server-class microarchitectures that might exhibit different relaxed behaviors are notably absent.
Reliance on Informal "Architectural Intent": The foundation of the axiomatic model's correctness appears to rest on "detailed discussions with Arm senior staff" (§1.1, pg 1) and the authors' interpretation of "architectural intent" (§5.1, pg 8). A formal model cannot be grounded in informal conversations. This is not a verifiable or repeatable scientific process. Without a formal, machine-readable specification from the vendor to check against, the model is merely a hypothesis, not a validated artifact. The rephrasing of the Arm prose specification in Figure 2 (pg 4) is the authors' own, not an official definition.
Failure to Solve the Core Problem: The paper frames the core challenge as defining precision in a relaxed setting (§6, pg 8). However, it does not actually provide a new, formal definition. Instead, it "identif[ies] and discuss[es] the substantial open problem" (§1.1, pg 2). The paper enumerates phenomena and challenges but stops short of delivering a conclusive, generalizable definition of precision that could be applied across architectures. It highlights the problem effectively but does not solve it.

Questions to Address In Rebuttal

The model for SGIs and the GIC is described as a "draft" (§7.5, pg 12) and relies on non-existent or non-public ASL. How can the community build upon or verify a contribution that is admittedly incomplete and based on un-specified machinery? Is it appropriate to present such preliminary work as a primary contribution at this venue?
Your validation strategy relies on consistency between your model, a small set of hardware tests, and "architectural intent" derived from informal discussions (§5.1, pg 8). How do you formally guarantee that your model is not overly permissive (i.e., allows behaviors forbidden by the true architecture) or overly restrictive (i.e., forbids behaviors that some implementations might allow)? What is your strategy for validation beyond this informal and limited-scope approach?
Given the high number of 'allowed but unobserved' (U) outcomes in your hardware testing (Fig 9, pg 6), how can you confidently forbid certain behaviors in your model (e.g., those in Fig 5, pg 5)? Absence of evidence is not evidence of absence. Is it not possible that these forbidden behaviors could manifest on other microarchitectures or under different testing conditions?
The paper's title concerns "Precise exceptions," yet the work explicitly scopes out imprecise exceptions, which form the necessary contrast for a complete definition. How does this significant omission affect the utility and generalizability of your findings and model? Can one truly clarify precision without also formally bounding imprecision?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper tackles a long-standing and fundamental problem at the intersection of computer architecture, formal methods, and systems programming: what does it mean for an exception to be "precise" in a world of relaxed memory concurrency? The authors correctly identify that the classical definition, rooted in a sequential execution model, is woefully inadequate for modern out-of-order, multi-core processors like Arm-A. The core contribution is the development of a formal axiomatic model that clarifies the concurrency semantics across exception boundaries. This model is not merely theoretical; it is grounded in hardware testing, deep architectural discussions, and implemented in the Isla tool for validation. Crucially, the work extends this analysis to the complex, real-world use case of software-generated interrupts (SGIs), demonstrating the direct impact of these semantics on critical systems software like the Linux kernel's RCU mechanism.

Strengths

Addresses a Foundational Gap: This work is highly significant because it addresses a conceptual blind spot that has existed for decades. The community has largely modeled user-space relaxed memory behavior, but the systems-level interactions involving exceptions and interrupts have remained a "grey area" governed by folklore and cautious over-approximation. By providing a formal basis for reasoning about these interactions, this paper provides an essential service to architects, compiler writers, and OS developers.
Excellent Methodological Synthesis: The paper is a model of how to conduct research in this domain. It skillfully combines several research traditions:
- Empirical investigation through hardware litmus testing (§3.6, page 6).
- Deep architectural analysis based on vendor documentation and direct discussion with architects.
- Formal modeling using the established axiomatic 'cat' framework (§5, page 7).
- Tool-based validation using an executable-as-test-oracle (Isla) (§5.1, page 8). This multi-pronged approach builds high confidence in the results and connects the abstract model to concrete reality.
Strong Connection to Real-World Systems: The paper's final act—the analysis of SGIs for synchronization patterns like RCU (§7, page 9-11)—is its masterstroke. This section elevates the work from a purely architectural or formal methods exercise into something of immediate relevance to the systems community. It explains the why behind the need for this formal clarity and demonstrates that subtle architectural ordering rules have a direct and profound impact on the correctness and performance of widely-used software.
Intellectual Honesty and Vision: The authors do an excellent job of delineating the scope of their contribution while also clearly articulating the remaining open challenges. The discussion in §6 ("Challenges in defining precision", page 8) is particularly insightful, framing their work as a crucial step towards solving a much larger problem, rather than claiming to have solved it entirely. This positions the paper as a foundational piece upon which a great deal of future work can be built.

Weaknesses

While this is an excellent paper, its primary weakness is a natural consequence of its depth and ambition: complexity and accessibility.

High Barrier to Entry: The subject matter is inherently complex, and a reader not already steeped in the literature of relaxed memory models and axiomatic semantics will find it challenging. While the authors do a commendable job, a short, high-level "Programmer's Takeaway" section summarizing the key invariants that a systems programmer can now rely on (or must be wary of) would significantly broaden the paper's impact.
Arm-A Focus: The deep dive into Arm-A is a major strength, but as a synthesizer, I am left wondering about the generality of the phenomena described. While the authors expect challenges to appear in other architectures, a brief discussion of how the core concepts (e.g., context synchronization, speculative entry) might manifest in RISC-V (with its evolving specifications) or POWER would help contextualize the Arm-A-specific details and frame a more general research agenda.
Preliminary SGI Model: The authors are upfront that their axiomatic extension for SGIs is a "draft" (§7.5, page 12). While the analysis leading up to it is sound, this leaves a key part of the contribution feeling less complete than the rest of the work. The full formalization of the GIC is rightly identified as a major undertaking, but the current model's limitations could be more sharply defined.

Questions to Address In Rebuttal

Generalizability: How would your conceptual framework for relaxed exceptions apply to other major architectures? For instance, the RISC-V specification includes extensions for supervisor-level behavior ("S"-extension) but is less prescriptive than Arm's. Do you foresee similar or fundamentally different challenges in modeling precision there?
Actionable Guidance for Programmers: Based on your findings, what is the single most surprising or important behavior that a kernel developer writing an exception handler on Arm-A should be aware of? For example, is it the fact that memory accesses can be reordered across the SVC/ERET boundary, or something more subtle related to system register dependencies?
The Path to a Full Definition of Precision: Your discussion in §6 is excellent. Does your model offer a partial, constructive definition of precision for a subset of cases (e.g., for resumable faults), or does its primary utility lie in formally demonstrating the inadequacy of the old definition? What do you see as the most critical next step—is it modeling the UNKNOWN states, or something else?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents a new, formal framework for defining and reasoning about precise exceptions in the context of relaxed-memory computer architectures. The core novel claim is the creation of the first-ever formal, axiomatic model that precisely specifies the observable behaviors at exception boundaries on a modern, weakly-ordered processor (using ARMv8-A as the primary example). This is achieved by developing a new set of axioms within the isla specification tool that constrain the allowed interactions between the instruction stream, the memory system, and the exception handling mechanism. The work claims this provides a new, rigorous foundation for understanding a part of the architectural specification that has, until now, been defined only by informal prose.

Strengths

From a novelty standpoint, this paper is a significant breakthrough. It does not just incrementally improve an existing idea; it creates a new one by applying formal methods to a domain where they had not been successfully applied before.

A Fundamentally New Type of Architectural Model: The most significant "delta" of this work is that it creates a new category of architectural artifact. While formal models for memory consistency and instruction semantics have existed for years, this is the first formal, axiomatic model of the exception mechanism for a modern, relaxed-memory architecture. It takes a concept that was previously "in the ether"—defined by prose, convention, and folklore—and crystallizes it into a mathematically precise and machine-checkable definition. This is a foundational and highly novel contribution. 🧠
Novel Insights from a New Perspective: By applying the lens of formal methods to this problem, the authors have uncovered new and non-obvious behaviors and ambiguities in the existing ARMv8-A specification (Section 5, Page 10). The very act of formalizing the prose revealed corner cases and interactions that were not apparent from a traditional, informal reading. The discovery and formalization of these subtleties is a novel and important scientific result.
A New Methodology for Architectural Specification: This work doesn't just produce a model; it implicitly proposes a new, more rigorous way to specify computer architectures. It makes a powerful case that informal prose is no longer sufficient for defining the complex interactions in modern processors. The methodology of using a tool like isla to create an executable, testable specification for the exception model is a novel approach that could, and should, be adopted by hardware vendors in the future.

Weaknesses

The novelty of the work is so profound that its primary weakness is that it stands alone. It is a new foundation, but the structure that will be built upon it is not yet clear.

The "So What?" Question: The paper does an excellent job of proving that a formal model can be built and what that model looks like. However, it spends less time demonstrating the novel capabilities that this new model enables. The evaluation is focused on demonstrating the model itself, not on using the model to do something that was previously impossible (e.g., formally verifying a piece of a hypervisor's exception handling code). The novelty is in the creation of the tool, not yet in the results of its application.
A Bespoke Creation: The formal model is a massive, bespoke intellectual effort. It is not yet clear how the process of creating this model could be generalized or automated. Is this a one-time heroic effort, or does it represent a new, repeatable methodology that could be applied to other architectures (like RISC-V or x86) with a reasonable amount of effort? The novelty of the methodology, as a repeatable process, is not yet proven.
Conceptual, Not Algorithmic, Novelty: It is important to note that the novelty here is in the application and the conceptual framework, not in the invention of new core formal methods algorithms. The work leverages an existing tool (isla) and standard techniques from the formal methods community. The innovation is in the bridge between two fields, not in a breakthrough within the field of formal methods itself.

Questions to Address In Rebuttal

Your work is the first to create a formal model of this kind. What is the most significant, previously unknown or ambiguous architectural behavior that your novel model helped to clarify or discover?
The creation of this model was clearly a massive undertaking. What is the key novel insight you have gained that would make the process of formally modeling the exception architecture of a different processor (e.g., RISC-V) an order of magnitude easier?
Now that you have this novel formal model, what is the single most important, previously unsolvable verification problem that you believe can now be solved with it?
The novelty of this work is in applying formal methods to a new architectural domain. What do you see as the next "dark corner" of the architectural specification that is currently defined only by prose and is in desperate need of a similar formal modeling effort?

Rethinking Prefetching for Intermittent Computing

Abstract

Prefetching improves performance by reducing cache misses. However, conventional prefetchers are too aggressive to serve batteryless energy harvesting systems (EHSs) where energy efficiency is the utmost design priority due to weak input energy and the ...

ACM LINK

Reviews

Review 1

Persona 1: The Guardian (Adversarial Skeptic)

Review Form

Summary

This paper introduces IPEX, an extension for hardware prefetchers designed for energy harvesting systems (EHSs) that experience frequent power failures. The core idea is to make prefetching "intermittence-aware." By predicting an impending power outage, IPEX throttles the aggressiveness of the prefetcher to avoid fetching data that will never be used before the system loses power, thereby saving energy. The authors claim this approach reduces energy consumption and improves forward progress compared to conventional prefetching in an intermittent execution environment.

Strengths

The paper correctly identifies a clear and logical problem.

Valid Problem Identification: The central premise is sound: aggressive, conventional prefetching designed for always-on systems is likely to be energy-inefficient in a system that loses power frequently, as many prefetched lines will be useless (Section 1, Page 1). This is a valid and important problem to address for this class of device.

Weaknesses

The paper's conclusions are built upon a foundation of unrealistic assumptions, a flawed evaluation methodology, and an incomplete analysis of critical system overheads.

Fundamentally Unrealistic Power Failure Prediction: The entire mechanism of IPEX hinges on the ability to "properly" predict an upcoming power failure (Section 1, Page 1). The paper relies on a simple voltage-threshold-based predictor and hand-waves away the immense difficulty of this problem. In real EHSs, the power input is noisy and unpredictable. A simple voltage drop is not a reliable indicator of an imminent, unrecoverable outage; it could be a transient dip. This means the predictor will have non-zero false positives (throttling prefetching when it shouldn't, hurting performance) and false negatives (failing to throttle when it should, wasting energy). The evaluation completely ignores the impact of prediction inaccuracy, assuming a perfect oracle. This is a fatal flaw that invalidates the real-world applicability of the results.
Overhead of IPEX Logic is Ignored: The paper claims energy savings but fails to account for the energy consumed by the IPEX hardware itself. The logic to track prefetch metadata, monitor the voltage predictor, and dynamically adjust the prefetch degree (Section 4, Page 4) is not "free." It consumes static and dynamic power. Without a detailed, post-layout power analysis of the IPEX module itself, the claimed net energy savings are unsubstantiated. The savings could be entirely consumed by the overhead of the controller.
Incomplete and Unfair Baseline Comparison: The evaluation compares IPEX integrated with a stride prefetcher against the same stride prefetcher without IPEX (Section 5, Page 6). This is an insufficient baseline. A more rigorous evaluation would compare IPEX against other, more conventional energy-saving techniques. For example, a simple approach would be to statically configure the stride prefetcher to be less aggressive (e.g., reduce its degree or distance). It is not proven that the complexity of IPEX's dynamic throttling is superior to a simpler, statically-tuned conservative prefetching policy.
Contradictory Performance Claims: The paper claims that by saving energy, IPEX allows the system to "make further execution progress" (Abstract, Page 1). This is a logical leap. The act of throttling the prefetcher inherently increases the cache miss rate when the power doesn't fail, which reduces performance. The paper is conflating two scenarios: it improves energy efficiency when a power failure is imminent, but it necessarily degrades performance when a failure is incorrectly predicted. The reported average performance improvement (8.96%) is an artifact of an evaluation that likely does not properly penalize the system for false positives from the power failure predictor.

Questions to Address In Rebuttal

Please provide a sensitivity analysis showing how the energy and performance benefits of IPEX degrade as the accuracy of the power failure predictor decreases. What is the break-even point in terms of predictor accuracy where IPEX provides no net benefit?
Provide a detailed, post-layout power and area analysis of the IPEX hardware module itself. You must explicitly report this overhead to prove that the mechanism provides a net system-wide energy saving.
How does IPEX compare to a baseline system that uses a simple, statically configured, less-aggressive prefetcher (e.g., a stride prefetcher with a maximum degree of 1)? You must show that the dynamic complexity of IPEX is superior to a simple, conservative policy.
Can you provide a detailed, cycle-by-cycle analysis of a scenario where the power predictor generates a false positive? You must quantify the performance loss (i.e., the increase in stall cycles due to additional cache misses) incurred by unnecessarily throttling the prefetcher.

Review 2

Persona 2: The Synthesizer (Contextual Analyst)

Review Form

Summary

This paper, "Rethinking Prefetching for Intermittent Computing," introduces IPEX, an intelligent hardware extension designed to make conventional prefetchers suitable for the unique environment of battery-less, energy harvesting systems (EHSs). The core contribution is to make the prefetcher "intermittence-aware." By monitoring the system's energy state and predicting impending power outages, IPEX dynamically throttles the aggressiveness of the prefetcher. This prevents the system from wasting precious harvested energy fetching data that will be lost on power failure anyway, allowing that energy to be used for useful computation instead. This work effectively bridges the gap between classic high-performance computing techniques and the emerging domain of energy-constrained, intermittent computing.

Strengths

This paper presents a simple, elegant, and powerful idea that addresses a critical and forward-looking problem. Its strength lies in its clear understanding of the fundamental constraints of its target domain and its pragmatic approach to solving them.

A Necessary Rethinking of a Classic Technique: The most significant contribution of this work is its recognition that a classic, well-understood performance optimization—prefetching—needs to be fundamentally re-thought for a new class of computing systems (Section 1, Page 1). The paper correctly identifies that the design assumptions of the "always-on" world do not apply to intermittent systems, and it provides a clear and logical framework for adapting to this new reality. This is a crucial step in making high-performance computing techniques viable in ultra-low-power environments. 💡
Elegant Synthesis of Multiple System Signals: IPEX is a beautiful example of a holistic system design. It doesn't just look at the memory access stream; it synthesizes information from multiple parts of the system—the memory system (prefetch requests), the power management unit (voltage levels), and the processor core (execution progress)—to make an intelligent, system-level decision. This cross-layer approach, where the memory system is aware of the power system, is a hallmark of a well-designed, domain-specific architecture.
Enabling a More Capable Class of Devices: The practical impact of this work could be significant. By improving the energy efficiency of EHSs, IPEX allows them to do more useful work with the same amount of harvested energy. This could enable a new generation of more capable, intelligent, and autonomous sensor devices for applications in the Internet of Things (IoT), medical implants, and environmental monitoring, where battery replacement is not an option. It helps push intermittent computing from a research curiosity toward a practical reality.

Weaknesses

While the core idea is strong, the paper could be strengthened by broadening its scope to the larger software and system ecosystem in which IPEX would operate.

The Checkpointing Interaction: The paper focuses exclusively on the interaction between prefetching and power failures. However, all intermittent systems rely on a checkpointing mechanism to save and restore state. There is a deep and unexplored interaction here: a more energy-efficient system might be able to take more frequent checkpoints, which in turn could influence prefetching decisions. A discussion of how IPEX could be co-designed with the system's checkpointing policy would be a fascinating addition.
The Role of the Compiler: The current IPEX design is purely hardware-driven. However, a compiler could provide valuable hints to the prefetcher. For example, a compiler could identify critical data structures that are essential for forward progress and instruct the hardware to prioritize prefetching them, even when power is low. A discussion of the potential for a hardware/software co-designed, intermittence-aware prefetching system would be very interesting.
Beyond Simple Prefetchers: The paper evaluates IPEX with a simple stride prefetcher. However, the world of prefetching is incredibly rich, with more advanced techniques like stream, delta, and instruction-based prefetchers. An exploration of how the IPEX control philosophy could be adapted to these more complex and powerful prefetching engines would broaden the applicability of the work.

Questions to Address In Rebuttal

Your work makes the prefetcher aware of the power system. How do you see this idea being co-designed with the system's checkpointing mechanism? Could the prefetcher, for example, be instructed to prioritize fetching data that is about to be checkpointed?
What role could a compiler play in an IPEX-like system? Could compiler-inserted "hints" help the hardware make more intelligent decisions about which data is most critical to prefetch before a power failure? 🤔
How would you adapt the IPEX control policy to a more complex, stateful prefetcher, such as a stream prefetcher? Would the decision to throttle be a simple on/off switch, or would it involve a more nuanced adjustment of the prefetcher's internal parameters?
Looking at the bigger picture, this work adapts a performance technique for energy efficiency. What other "classic" performance optimizations from the high-performance computing world (e.g., branch prediction, speculative execution, caching policies) do you think need to be fundamentally re-thought to be effective in the intermittent computing domain?

Review 3

Persona 3: The Innovator (Novelty Specialist)

Review Form

Summary

This paper proposes IPEX, an "Intermittence-aware Prefetching Extension" for energy harvesting systems. The core novel claim is a new hardware control mechanism that dynamically modulates the aggressiveness of a conventional hardware prefetcher based on the likelihood of an impending power failure. This is achieved by coupling the prefetcher's control logic with a power failure predictor. When a power outage is predicted to be imminent, IPEX throttles the prefetcher's degree to prevent it from fetching data that would be lost, thereby saving energy. The synthesis of a power predictor with a prefetch controller to create a new, "intermittence-aware" hardware module is presented as the primary novel contribution.

Strengths

From a novelty standpoint, this paper's strength lies in its creation of a new, domain-specific control policy for a well-known hardware component, effectively creating a new class of prefetcher.

A Novel Prefetcher Control Policy: The most significant "delta" in this work is the introduction of a new axis for prefetcher control: intermittence awareness. While prior art has explored adaptive prefetchers that modulate their behavior based on accuracy or memory bandwidth, this is the first work to propose a prefetcher that adapts its behavior based on the energy state of the system and the temporal horizon of its own utility (i.e., the time until the next power failure). This is a fundamentally new and non-obvious control heuristic for prefetching hardware. 💡
A New Synthesis of System Components: The IPEX architecture itself is a novel synthesis of previously disconnected system components. It creates a direct data path and control relationship between the power management unit (the voltage sensor) and the memory subsystem (the prefetcher). This tight coupling of power and memory management at the microarchitectural level is a novel design pattern that has not been explored in prior work on conventional or low-power systems.

Weaknesses

While the core concept is novel, it is important to contextualize its novelty. The work cleverly combines existing components but does not invent fundamentally new predictive or prefetching mechanisms.

Core Components Are Not New: The novelty of IPEX is in the synthesis, not in the constituent parts.
- Prefetcher: The paper uses a standard stride prefetcher as its base. It does not propose a new algorithm for identifying memory access patterns.
- Power Failure Predictor: The paper uses a simple voltage-threshold-based power failure predictor. While effective for the paper's purpose, it is an adaptation of known techniques in the energy harvesting systems literature, not a new prediction algorithm.
"Throttling" is a Standard Control Action: The mechanism of reducing the degree or aggressiveness of a prefetcher is a standard knob in adaptive prefetching designs. The novelty is not the act of throttling, but the trigger for that throttling (i.e., a predicted power failure).
Performance Gains Are Not a Novel Insight: The paper's reported performance and energy improvements are a direct and expected consequence of a more intelligent control policy. It is not a novel discovery that avoiding useless work saves energy. The novelty is in the creation of the mechanism that can identify and prevent that useless work, not in the result itself.

Questions to Address In Rebuttal

The core of your novelty is the "intermittence-aware" control policy. Can you contrast your approach with prior work on power-aware or energy-aware prefetching in the mobile/low-power computing domain? What is the key "delta" that makes your intermittence-aware approach fundamentally different?
The IPEX module creates a new, direct interface between the power predictor and the prefetcher. Is this a point solution, or do you see this as a novel, generalizable microarchitectural pattern? What other components in the processor could benefit from being directly controlled by a power failure predictor?
If a competitor were to propose a system that used a sophisticated machine-learning-based power failure predictor, but a very simple prefetch throttling mechanism, where would the novelty of your system lie? Does the novelty rest more on the concept of the coupling, or on the specific implementation of the control logic?
What is the most non-obvious or surprising interaction you discovered between the behavior of the prefetcher and the power state of the system that led you to the specific design of the IPEX control policy?

Qtenon: Towards Low-Latency Architecture Integration for Accelerating Hybrid Quantum-Classical Computing

Abstract

Hybrid quantum-classical algorithms have shown great promise in leveraging the computational potential of quantum systems. However, the efficiency of these algorithms is severely constrained by the limitations of current quantum hardware architectures. ...

ACM LINK

Reviews

Review 1

Persona 1: The Guardian (Adversarial Skeptic)

Review Form

Summary

This paper proposes Qtenon, a tightly coupled hardware/software architecture for hybrid quantum-classical computing. The authors identify the high latency of communication between the classical host and the quantum processor (QPU) as a major bottleneck. To address this, they propose a system with a unified memory hierarchy, an efficient multi-stage quantum controller, and a corresponding compiler stack based on a custom IR. The authors claim this tightly integrated approach provides significant speedups (up to 21.6x) for hybrid algorithms like VQE and QAOA compared to current, loosely-coupled designs.

Strengths

The paper correctly identifies a well-known and critical problem in the field of quantum computing.

Valid Problem Identification: The central premise is sound: the communication latency between the classical and quantum components in current systems is a first-order performance bottleneck that severely limits the practicality of many promising hybrid algorithms (Section 1, Page 1).

Weaknesses

The paper's conclusions are built upon a foundation of an oversimplified problem representation, a flawed and inequitable baseline, and a failure to address the most difficult physical and engineering challenges inherent in its proposal.

Fundamentally Unrealistic Baseline Comparison: The headline performance claims are invalid because the baseline "decoupled system" is a strawman. It models the quantum-classical interface as a simple network socket connection (Section 5.1, Page 8), which represents the most naive possible implementation. State-of-the-art quantum control systems from both industry and academia employ far more sophisticated real-time control hardware and optimized communication protocols. By comparing against a simplistic baseline, the reported speedups are grossly exaggerated and do not reflect the true benefit of the Qtenon architecture over a realistic, well-engineered alternative.
Critical Overheads are Ignored: The paper's core proposal, the "unified memory hierarchy," is presented without a rigorous analysis of its physical implementation and associated overheads. The QPU operates at cryogenic temperatures (millikelvin), while the classical host operates at room temperature. The paper completely ignores the immense physical and engineering challenges of creating a low-latency, high-bandwidth memory interface that bridges this thermal gap. The latency and power consumption of the specialized I/O links required for this are not modeled, and could easily consume a significant portion of the claimed performance gains.
Scalability Claims are Unsubstantiated: The evaluation is performed on small-scale quantum algorithms with a limited number of qubits (up to 30) and parameters (Table 1, Page 9). There is no evidence to suggest that the proposed architecture, particularly the centralized quantum controller and memory arbiter, can scale to the thousands or millions of qubits required for fault-tolerant quantum computing. The centralized design is a classic bottleneck, and the paper fails to provide a convincing argument for how it will avoid becoming a performance limiter at larger scales.
Compiler Optimizations Are Not Rigorously Evaluated: The paper proposes several compiler optimizations, such as "fine-grained instruction scheduling" and "just-in-time (JIT) compilation" (Section 4, Page 7). However, the performance impact of these optimizations is not isolated in the evaluation. It is unclear how much of the speedup comes from the hardware architecture versus these software techniques. A fair evaluation would compare the full Qtenon system against a baseline that also employs the same advanced compiler optimizations.

Questions to Address In Rebuttal

To provide a fair comparison, please evaluate Qtenon against a more realistic baseline that models a state-of-the-art, real-time quantum control system, not a simple socket-based interface.
Please provide a detailed physical model and analysis of your proposed unified memory interface. What is the projected latency, bandwidth, and power consumption of the cryogenic-to-room-temperature link, and how does this impact your end-to-end performance claims?
Provide a scalability analysis of your centralized quantum controller. At what number of qubits do you project that the controller will become the primary performance bottleneck, and what is your strategy for scaling beyond that limit?
Please provide an ablation study that isolates the performance contribution of your proposed hardware architecture (unified memory, controller) from your proposed compiler optimizations. This is necessary to prove the specific value of the hardware design itself.

Review 2

Persona 2: The Synthesizer (Contextual Analyst)

Review Form

Summary

This paper introduces Qtenon, a tightly coupled, co-designed hardware and software system for accelerating hybrid quantum-classical algorithms. The work's central contribution is a new architectural paradigm that moves beyond the current, loosely-coupled model of a classical host computer controlling a remote quantum accelerator. Qtenon proposes a unified memory hierarchy that allows the classical and quantum processors to communicate directly through shared memory, a multi-stage, pipelined quantum controller to efficiently manage the QPU, and a sophisticated, LLVM-based compiler stack to orchestrate the entire system. By dramatically reducing the quantum-classical communication latency, Qtenon aims to make complex, iterative hybrid algorithms practical for the first time.

Strengths

This paper presents a forward-looking and deeply important vision for the future of quantum computing systems. Its strength lies in its application of well-understood principles from classical computer architecture to solve a critical, emerging problem in the quantum domain.

A Necessary Architectural Evolution: The most significant contribution of this work is that it provides a concrete and well-reasoned architectural blueprint for the next generation of quantum computers. The current "remote accelerator" model is a necessary first step, but it is not a viable long-term solution. Qtenon correctly identifies that a tighter integration is essential and thoughtfully applies decades of lessons from classical heterogeneous computing (e.g., the evolution of CPU-GPU interfaces) to the quantum world. This is not an incremental improvement; it is a fundamental and necessary evolution of the quantum computing system stack. 🚀
Elegant Synthesis of Hardware and Software: Qtenon is a beautiful example of a true hardware/software co-design. It recognizes that the latency problem cannot be solved by hardware alone. The proposed compiler stack, with its custom QIR intermediate representation and its ability to perform fine-grained scheduling and JIT compilation (Section 4, Page 7), is just as important as the unified memory and the advanced controller. This holistic, full-stack approach is a hallmark of a mature and well-considered system design.
Enabling the Future of Quantum Algorithms: The practical impact of this work could be immense. The massive latency of current systems makes many of the most promising hybrid algorithms, particularly those that require rapid feedback between the classical and quantum components (like certain quantum machine learning or optimization algorithms), completely impractical. By reducing this latency by over an order of magnitude (Figure 9, Page 11), Qtenon could enable a whole new class of algorithms to be explored, potentially accelerating the timeline for achieving a practical quantum advantage.

Weaknesses

While the high-level vision is compelling, the paper could be strengthened by exploring the deeper engineering and ecosystem challenges that would need to be overcome for this vision to become a reality.

The Cryogenic Memory Challenge: The paper's concept of a "unified memory" (Section 3.1, Page 3) is powerful, but it abstracts away the immense physical challenge of building a high-performance memory system that spans the cryogenic-to-room-temperature boundary. A discussion of the potential physical technologies that could enable this (e.g., superconducting SFQ logic, high-bandwidth optical links) would ground the architectural concept in physical reality.
The Role of Error Correction: The paper focuses on accelerating NISQ-era algorithms. However, the long-term future of quantum computing is in fault-tolerant systems, which will require massive amounts of real-time classical processing for decoding quantum error correction codes. A discussion of how the Qtenon architecture could be adapted or scaled to meet the even more demanding latency requirements of fault-tolerant decoding would be a fascinating extension.
Standardization and the Ecosystem: For a tightly coupled system like Qtenon to succeed, there needs to be standardization at the hardware-software interface. The paper proposes a custom IR, but it would be beneficial to discuss how this work aligns with emerging industry and academic efforts to standardize quantum intermediate representations (like QIR). A discussion of the path from this research prototype to an open, standardized ecosystem would be valuable.

Questions to Address In Rebuttal

Your work brilliantly applies lessons from classical heterogeneous computing to the quantum domain. Looking forward, what is the next major lesson from the history of classical architecture that you believe should be applied to the design of quantum computers?
The unified memory is a powerful abstraction. If you were to start building a physical prototype tomorrow, what specific technologies would you investigate to bridge the cryogenic-to-room-temperature gap, and what do you see as the biggest engineering challenge?
How do you see the Qtenon architecture evolving to meet the demands of real-time decoding for fault-tolerant quantum error correction, which requires even lower latencies than the hybrid algorithms you evaluated? 🤔
The success of your system depends on a rich software ecosystem. How does your proposed compiler stack align with broader community efforts like the QIR alliance, and what is the path to creating a standardized interface that would allow different quantum programming languages to target a Qtenon-like architecture?

Review 3

Persona 3: The Innovator (Novelty Specialist)

Review Form

Summary

This paper introduces Qtenon, a new, tightly coupled system architecture for hybrid quantum-classical computing. The core novel claim is the architecture itself: a synthesis of hardware and software components designed to fundamentally reduce the communication latency between a classical host and a quantum processor (QPU). The primary novel hardware components are 1) a unified memory hierarchy that serves as a low-latency communication buffer between the host and the QPU (Section 3.1, Page 3), and 2) a multi-stage, pipelined quantum controller designed for efficient, just-in-time instruction generation (Section 3.2, Page 4). The novel software component is a custom compiler stack that uses a new Quantum Intermediate Representation (QIR) to enable fine-grained instruction scheduling and runtime code generation (Section 4, Page 7).

Strengths

From a novelty standpoint, this paper is a significant contribution because it proposes a fundamentally new system-level architecture for a problem that has previously been addressed with ad-hoc, non-integrated solutions.

A Novel System-Level Architecture: The most significant "delta" in this work is the shift from a "networked" or "distributed" model of quantum-classical computing to an integrated, shared-memory model. While shared memory is a classic concept, its application to bridge the quantum-classical divide, complete with a detailed proposal for the memory hierarchy, the controller pipeline, and the software stack, is a fundamentally new architectural paradigm for this domain. It is the first paper to present a complete, cohesive blueprint for a truly integrated hybrid system. 🧠
A New Quantum Controller Design: The proposed multi-stage, pipelined quantum controller is a novel piece of microarchitecture. Prior work has focused on the lower-level pulse generation aspects of quantum control. The Qtenon controller, with its explicit stages for instruction fetching, decoding, scheduling, and JIT compilation (Figure 4, Page 6), is a new, more sophisticated design that elevates the controller from a simple sequencer to a true co-processor.
Novel Synthesis of Compiler and Hardware: The co-design of the compiler and the hardware is a key novelty. The introduction of a new IR (QIR) and the use of JIT compilation in the controller's pipeline are not just software add-ons; they are integral to the hardware's design and enable its low-latency operation. This tight, synergistic coupling of a JIT compiler with a hardware controller pipeline is a novel approach in the quantum domain.

Weaknesses

While the overall architecture is novel, it is important to contextualize its novelty. The work cleverly applies well-known principles from classical computing to a new domain, but it does not invent fundamentally new low-level primitives.

Component Concepts are Inspired by Prior Art: The novelty is primarily in the synthesis and application, not in the invention of the base concepts.
- Unified Memory: The idea of a shared memory space for a host and an accelerator is a well-established concept in classical heterogeneous computing (e.g., CUDA's Unified Memory). The novelty here is its application to the unique quantum-classical context.
- Pipelining: Pipelining is one of the most fundamental concepts in computer architecture. The novelty is the specific pipeline designed for the quantum control task, not the idea of pipelining itself.
- LLVM/IR-based Compilation: Using a formal IR and the LLVM toolchain for hardware compilation is a standard practice in the HLS and accelerator design communities. The novelty is the creation of a new IR (QIR) with the specific semantics required for quantum computation.
The "First" Claim is Specific: The claim to be the "first" tightly coupled architecture is a strong one, but it is specific to the publicly-documented academic literature. The novelty lies in being the first to describe and evaluate such an architecture in detail, not necessarily in being the first to conceive of the idea.

Questions to Address In Rebuttal

The core of your novelty is the unified memory architecture. Can you contrast your approach with prior work on shared virtual memory for classical heterogeneous systems (e.g., CPU-GPU)? What is the key "delta" or non-obvious challenge that makes unified memory for a quantum system a fundamentally new problem?
The multi-stage controller pipeline is a key component. What is the most novel aspect of this pipeline compared to the real-time micro-sequencers and controllers used in other domains that require low-latency feedback, such as high-frequency trading or particle accelerator control systems?
The QIR is central to your software stack. How does your proposed IR differ fundamentally from other emerging quantum intermediate representations, and what novel capabilities does your IR enable that others do not?
If a competitor were to achieve a similar latency reduction using a different approach (e.g., a highly-optimized, dedicated network-on-chip instead of unified memory), would the novelty of your work be diminished? What is the fundamental, enduring novelty of the "shared memory" approach that makes it superior to a highly-optimized message-passing approach?

HiPER: Hierarchically-Composed Processing for Efficient Robot Learning-Based Control

Abstract

Learning- Based Model Predictive Control (LMPC) is a class of algorithms that enhances Model Predictive Control (MPC) by including machine learning methods, improving robot navigation in complex environments. However, the combination of machine learning ...

ACM LINK

Reviews

Review 1

Persona 1: The Guardian (Adversarial Skeptic)

Review Form

Summary

This paper introduces HiPER, a processing array architecture designed to accelerate Learning-Based Model Predictive Control (LMPC) for robotics. The authors identify that LMPC workloads exhibit dynamic, fine-grained parallelism that is poorly suited to conventional CPU-GPU systems. To address this, HiPER proposes a "hierarchically-composed" processing array, a fractal interconnect topology, and a pointer queue hierarchy for program execution. The authors claim this design adapts efficiently to the LMPC workload, demonstrating a 10.75x improvement over an NVIDIA Jetson AGX Orin in a 16nm CMOS synthesis.

Strengths

The paper correctly identifies a clear and relevant problem domain.

Valid Problem Identification: The central premise is sound: LMPC presents a challenging workload with a mix of parallelizable (ML inference) and sequential (MPC optimization) phases, and the communication and control overhead between a CPU and GPU can be a significant bottleneck for this type of real-time application (Section 1, Page 1).

Weaknesses

The paper's conclusions are fundamentally undermined by a flawed and inequitable baseline comparison, an over-simplification of the workload, and an incomplete analysis of critical system overheads.

Fundamentally Unsound Baseline Comparison: The headline 10.75x speedup claim is invalid because the comparison between a custom, application-specific ASIC (HiPER) and a general-purpose, off-the-shelf mobile SoC (NVIDIA Jetson AGX Orin) is a classic apples-to-oranges fallacy. An ASIC will almost always outperform a general-purpose processor on its target workload. A rigorous and fair comparison would require evaluating HiPER against another custom accelerator designed for a similar class of problems or, at a minimum, against a highly optimized implementation on the GPU baseline that uses advanced features like CUDA graphs to minimize launch overhead. The reported speedup is an artifact of specialization, not a demonstrated architectural superiority over a comparable solution.
Workload Representation is Oversimplified: The evaluation is based on a single LMPC workload for a specific drone navigation task (Section 5, Page 7). This is an insufficient sample to prove that the architecture is generally effective for the broad class of LMPC algorithms. Real-world robotics applications involve far more complex sensor processing, state estimation, and planning tasks that are not represented in this simple workload. The HiPER architecture, with its fine-grained PEs, is likely to be highly inefficient for the coarse-grained, memory-intensive tasks that were conveniently excluded from the evaluation.
Critical Overheads are Ignored: The paper's performance analysis focuses on the execution time within the PE array but fails to properly account for critical system overheads. The pointer queue hierarchy is presented as a solution for control flow, but the latency and energy cost of traversing this multi-level queue structure to dispatch work are not adequately modeled. Furthermore, the paper provides no analysis of the off-chip memory bandwidth requirements. It is unclear if the system would be compute-bound, as the paper assumes, or if it would be severely bottlenecked by DRAM access in a real implementation with large ML models and complex environments.
Fractal Interconnect Justification is Weak: The paper proposes a "fractal interconnect" but provides insufficient evidence that this specific, complex topology is meaningfully superior to a standard, well-understood 2D mesh or a flattened butterfly network. The claim that it "efficiently supports the workload's traffic characteristics" (Abstract, Page 1) is not backed by a detailed traffic analysis or a comparison to alternative topologies under the same workload. The choice appears to be a novelty for its own sake rather than a rigorously justified engineering decision.

Questions to Address In Rebuttal

To provide a fair comparison, how does HiPER perform against a state-of-the-art, highly-optimized implementation on the Jetson AGX Orin baseline that uses techniques like CUDA graphs and expert-tuned kernels to minimize the software overhead you identify as the primary bottleneck?
Provide a detailed analysis of the performance and energy overhead of the pointer queue hierarchy. What is the end-to-end latency, from the host writing a command to a PE beginning execution, and how does this scale as the hierarchy deepens?
Please provide a sensitivity analysis showing how HiPER's performance degrades as the complexity of the non-LMPC parts of a realistic robotics pipeline (e.g., sensor fusion, SLAM) are included in the workload.
To justify your choice of a fractal interconnect, please provide a direct, quantitative comparison of its performance (latency, throughput) and cost (area, power) against a conventional 2D mesh and a flattened butterfly network for the exact same LMPC workload.

Review 2

Persona 2: The Synthesizer (Contextual Analyst)

Review Form

Summary

This paper introduces HiPER, a novel accelerator architecture explicitly co-designed for the unique computational demands of Learning-Based Model Predictive Control (LMPC) in robotics. The core contribution is a "hierarchically-composed" architecture that mirrors the hierarchical nature of the LMPC algorithm itself. This is achieved through a reconfigurable array of processing elements (PEs), a novel fractal interconnect that provides both local and global communication paths, and a pointer-based queueing system for low-overhead task dispatch. By creating a hardware architecture that is a direct physical manifestation of the target algorithm's structure, HiPER aims to eliminate the software and communication overheads that plague traditional CPU-GPU solutions, enabling a new level of performance and efficiency for autonomous robot control.

Strengths

This paper is a significant and forward-looking contribution that sits at the cutting edge of robotics, machine learning, and computer architecture. Its primary strength is its deep, system-level understanding of the target workload and its creation of a truly domain-specific, co-designed solution.

A Brilliant Example of Algorithm/Hardware Co-design: The most significant contribution of this work is its textbook execution of algorithm/hardware co-design. The authors have not simply accelerated a single kernel; they have analyzed a complete, complex application (LMPC) and designed a hardware architecture that is a direct, physical embodiment of the algorithm's structure (Section 2, Page 2; Section 3, Page 3). The hierarchical composition of the processing array, the fractal nature of the interconnect, and the pointer-based control flow are all direct responses to the specific needs of the LMPC workload. This is a masterclass in domain-specific architecture. 🤖
Enabling the Future of Robotics: The practical impact of this work could be immense. LMPC is a powerful technique that is central to the next generation of intelligent, autonomous robots. However, its real-world deployment has been severely limited by its computational cost. By providing an order-of-magnitude improvement in performance and efficiency (Figure 10, Page 9), HiPER could make advanced LMPC practical for a wide range of real-time, power-constrained applications, from autonomous drones and self-driving cars to surgical robots. This work could be a key enabler for the future of robotics.
Connecting Disparate Architectural Concepts: HiPER is a beautiful synthesis of ideas from different domains of computer architecture. It combines the fine-grained parallelism of systolic arrays (for local dataflow) with the coarse-grained control of a VLIW or dataflow processor (via the pointer queue hierarchy), and it connects them with an interconnect inspired by fractal geometry. This creative combination of established principles to create a new, powerful whole is a hallmark of innovative architectural thinking.

Weaknesses

While the core design is powerful, the paper could be strengthened by broadening its focus to the software ecosystem and the long-term evolution of the architecture.

The Programmability Challenge: The HiPER architecture is highly specialized. A key challenge, which is not fully explored, is how a developer would actually program it. The paper describes the pointer queue mechanism but does not detail the high-level programming model or the compiler toolchain that would be required to map a complex LMPC application onto the hardware. A discussion of the software and compiler challenges is critical for assessing the practical usability of the architecture.
Beyond LMPC: The architecture is exquisitely tailored to LMPC. However, it is possible that the core principles of hierarchical composition and fractal interconnects could be beneficial for other important workloads with similar characteristics. A discussion of how the HiPER architecture could be generalized or adapted to other domains, such as graph analytics, scientific computing, or other forms of reinforcement learning, would broaden the impact of the work.
The Memory System: The paper focuses on the processing array and interconnect but is relatively light on the details of the off-chip memory system. For more complex robotics applications with large ML models and high-dimensional state spaces, the interface to external DRAM will be a critical performance bottleneck. A more detailed analysis of the memory system's design and its interaction with the PE array would be a valuable addition.

Questions to Address In Rebuttal

Your work is a fantastic example of co-design. Looking forward, how do you envision the programming model for HiPER? What new language abstractions or compiler techniques would be needed to allow a robotics developer to productively map their algorithms onto your architecture?
The fractal interconnect is a beautiful match for the LMPC workload. What other application domains do you think could benefit from this type of hierarchically-aware interconnect topology? 🤔
How would the HiPER architecture need to evolve to efficiently handle the massive, irregular memory access patterns of other critical robotics tasks, such as running a large Vision Transformer for perception or performing a graph-based SLAM optimization?
If a new, non-hierarchical learning-based control algorithm were to emerge in the future, how adaptable is the HiPER architecture? Does its strength in LMPC come at the cost of being overly specialized?

Review 3

Persona 3: The Innovator (Novelty Specialist)

Review Form

Summary

This paper introduces HiPER, a new accelerator architecture for Learning-Based Model Predictive Control (LMPC). The core novel claims are the synthesis of three primary architectural features: 1) A "hierarchically-composed" processing array that can be dynamically configured to match the structure of the LMPC algorithm (Section 3.2, Page 3); 2) A pointer queue hierarchy for low-overhead, distributed control and orchestration of the processing elements (Section 3.3, Page 4); and 3) A fractal interconnect topology that provides both high-bandwidth local connections and efficient long-range global connections (Section 3.4, Page 5). The synergistic combination of these three features to create a new, workload-adaptive architecture is presented as the primary novel contribution.

Strengths

From a novelty standpoint, this paper is a significant contribution because it proposes a fundamentally new, holistic system architecture built from a collection of individually novel and clever components.

A Novel Architectural Paradigm: The most significant "delta" in this work is the concept of a "hierarchically-composed" architecture. While reconfigurable and array-based processors are known, HiPER is the first to propose an architecture where the composition and orchestration of the processing elements is itself hierarchical and is designed to be a direct, physical mirror of the target algorithm's structure. This is a new and powerful paradigm for domain-specific acceleration that moves beyond simple kernel offload to a more deeply co-designed system. 🧠
A Novel Control and Orchestration Mechanism: The pointer queue hierarchy is a novel and elegant solution to the problem of low-overhead control in a massively parallel array. While dataflow and VLIW architectures have explored distributed control, HiPER's use of a multi-level hierarchy of pointer-based FIFOs is a new and specific mechanism that is a clean fit for the target LMPC workload. It represents a new point in the design space between centralized, high-overhead control (like a traditional CPU) and purely data-driven execution.
A Novel Interconnect Topology: The application of a fractal H-tree topology as the basis for a reconfigurable NoC is a novel and non-obvious choice for an accelerator interconnect. While H-trees are used for clock distribution, their use as a general-purpose data network, particularly in a hierarchical and reconfigurable manner, has not been explored in prior art. It is a clever solution that directly addresses the mix of local and global communication required by the workload.

Weaknesses

While the core architecture is highly novel, it is important to contextualize its novelty. The work is a new synthesis of many ideas, and its direct applicability may be limited to its specific target domain.

Component Concepts Have Precedents: While the synthesis is new, the underlying ideas have conceptual roots in prior work. Hierarchical processing can be seen as an evolution of coarse-grained reconfigurable arrays. Pointer-based control has echoes of dataflow and transport-triggered architectures. The novelty is not in the invention of these base concepts from first principles, but in their clever adaptation and synergistic combination to solve a new, domain-specific problem.
High Degree of Specialization: The novelty of the architecture is inextricably linked to its high degree of specialization for the LMPC workload. The tight coupling of the architecture to the algorithm is its primary strength, but it also limits the scope of its novelty. It is a new architecture for LMPC, but it is not a fundamentally new general-purpose architectural paradigm.
Performance Gains Are a Consequence of Novelty: The reported speedups are a direct and expected consequence of creating a novel, application-specific architecture. It is not a novel discovery that a custom ASIC is faster than a general-purpose GPU. The novelty is in the creation of the architecture that enables this speedup, not in the speedup itself.

Questions to Address In Rebuttal

The core of your novelty is the "hierarchically-composed" architecture. Can you contrast your approach with prior work on coarse-grained reconfigurable architectures (CGRAs)? What is the key "delta" that makes your dynamic composition mechanism a fundamentally new approach?
The pointer queue hierarchy is a novel control mechanism. How is this fundamentally different from the token-based firing rules in classical dataflow architectures? What new capability does the pointer-based approach enable?
The use of a fractal H-tree for a data network is a novel choice. What is the most non-obvious advantage of this topology for the LMPC workload that would not be provided by a more conventional flattened butterfly or a concentrated mesh network?
If a competitor were to design a different LMPC accelerator using a different (but still novel) set of architectural principles, on what fundamental, enduring novelty would HiPER compete? Does the novelty lie in the specific components (fractal interconnect, pointer queues), or in the more general philosophy of hierarchical co-design?

Dadu-Corki: Algorithm-Architecture Co-Design for Embodied AI-powered Robotic Manipulation

Abstract

Embodied AI robots have the potential to fundamentally improve the way human beings live and manufacture. Continued progress in the burgeoning field of using large language models to control robots depends critically on an efficient computing substrate, ...

ACM LINK

Reviews

Review 1

Persona 1: The Guardian (Adversarial Skeptic)

Review Form

Summary

This paper introduces Dadu-Corki, a supposed algorithm-architecture co-design for robotic manipulation. The authors propose a new algorithm, CORKI, which is a variant of the existing RoboFlamingo model, and a new hardware accelerator, DADU, designed to run it. The core idea is that CORKI is modified to better suit hardware acceleration by, for example, using a fixed number of reasoning steps. The authors claim this co-design approach yields significant performance and energy efficiency gains over baseline GPU implementations.

Strengths

The paper correctly identifies a relevant and challenging problem.

Valid Problem Identification: The central premise is sound: vision-language models (VLMs) for robotic control are computationally expensive, and the latency of executing them on general-purpose hardware like GPUs is a major bottleneck for real-time manipulation (Section 1, Page 1).

Weaknesses

The paper's claims of a successful "co-design" are fundamentally undermined by a flawed evaluation, questionable algorithmic contributions, and an incomplete architectural analysis. The work is a classic example of designing a custom accelerator for a slightly modified algorithm and then claiming the result is a profound co-design.

Fundamentally Unsound Baseline Comparison: The headline performance claims (e.g., 6.4x speedup, 13.9x energy savings) are invalid because they are the result of an apples-to-oranges comparison. An application-specific ASIC (DADU) is being compared to a general-purpose GPU (NVIDIA RTX 4090) (Section 5.3, Page 8). An ASIC will always be more efficient for its target workload. This is not a fair or meaningful comparison. A rigorous evaluation would require comparing DADU against other, comparable robotic accelerators or proving that the algorithmic changes alone account for the benefits.
Algorithmic Contribution is Trivial and Unproven: The CORKI algorithm is presented as a key part of the co-design, but its novelty is minimal. It is a minor variant of the existing RoboFlamingo, with changes made for hardware convenience (e.g., fixed reasoning steps) rather than fundamental algorithmic improvement. Crucially, the paper provides insufficient evidence that CORKI is actually a better algorithm. The evaluation (Table 1, Page 10) shows that CORKI-3 has a longer average task sequence length than the baseline RoboFlamingo, suggesting it is a less efficient and possibly less intelligent policy. The claim of co-designing a better algorithm is unsubstantiated.
Critical Overheads are Ignored: The paper's analysis focuses on the core VLM computation but appears to ignore or minimize other critical system overheads. There is no detailed analysis of the latency or energy cost of the off-chip memory accesses, which would be significant for a model of this size. Furthermore, the paper provides no analysis of the full system stack, including the low-level robot control and sensor processing that would run alongside the VLM. The claimed end-to-end performance benefits are based on an incomplete and oversimplified view of the full robotics pipeline.
"Co-design" is Asserted, Not Demonstrated: The paper repeatedly uses the term "co-design," but there is no evidence of a genuine feedback loop between the algorithm and the architecture. It appears the authors simply took an existing algorithm, made minor modifications to make it more hardware-friendly, and then built a standard accelerator for it. A true co-design would involve using architectural insights to drive fundamental new algorithmic developments, or vice-versa. This is not demonstrated here.

Questions to Address In Rebuttal

To provide a fair comparison, how does the DADU accelerator perform against other, published ASIC accelerators for robotic or VLM workloads, when normalized for technology node and silicon area?
Your own results (Table 1, Page 10) show that your CORKI algorithm requires a longer sequence of steps to complete tasks than the RoboFlamingo baseline. How can you claim this is a superior algorithm when it appears to be less efficient at the task level?
Please provide a detailed breakdown of the off-chip DRAM traffic generated by your system. What percentage of the total execution time and energy is spent on DRAM access, and how does this compare to the on-chip computation?
Can you provide a concrete example of how a specific hardware design choice in DADU led to a fundamental, non-obvious change in the CORKI algorithm (or vice-versa)? This is necessary to prove that this is a true co-design, not just an algorithm port to a custom chip.

Review 2

Persona 2: The Synthesizer (Contextual Analyst)

Review Form

Summary

This paper introduces Dadu-Corki, a holistic, co-designed system that pairs a new robotic manipulation algorithm (CORKI) with a new, purpose-built hardware accelerator (DADU). The central contribution is the tight, synergistic coupling of the algorithm and the hardware. The CORKI algorithm, a variant of the powerful RoboFlamingo vision-language model, is optimized for hardware-friendliness, while the DADU architecture is a domain-specific accelerator with features explicitly designed to accelerate CORKI's unique computational patterns, such as its mix of attention and cross-attention mechanisms. This work aims to provide a complete, full-stack solution that bridges the gap between high-level AI-based robotics and low-level silicon design, enabling a new level of performance and efficiency for embodied AI.

Strengths

This paper is a significant and forward-looking contribution that exemplifies the future of high-performance robotics. Its strength lies in its deep, full-stack understanding of the problem and its creation of a complete, end-to-end solution.

A Brilliant Example of Algorithm/Hardware Co-design: The most significant contribution of this work is its textbook execution of algorithm/hardware co-design. The authors have not simply accelerated an existing piece of software; they have thoughtfully modified the CORKI algorithm to be more amenable to hardware acceleration and, in parallel, designed the DADU hardware to perfectly match the algorithm's needs (Section 2, Page 2). This virtuous cycle, where algorithmic insights inform hardware design and hardware constraints inform algorithmic choices, is the essence of true co-design and is a model for future work in the field. 🤖
Enabling the Future of Embodied AI: The practical impact of this work could be immense. The dream of "Embodied AI"—robots that can understand and interact with the world with human-like intelligence—is currently limited by the immense computational cost of the underlying AI models. By providing a solution that is an order of magnitude more performant and efficient than current GPU-based systems (Figure 6, Page 9), Dadu-Corki could be a key enabler for making this dream a practical reality. It helps pave the way for real-time, power-efficient, and truly intelligent robotic systems.
Connecting the Full System Stack: Dadu-Corki is a beautiful synthesis of ideas from across the entire computing stack. It combines insights from high-level AI and robotics (the VLM-based control policy), compiler and software engineering (the custom instruction set and mapping), and low-level digital logic design (the dataflow accelerator architecture). This ability to reason about and optimize the problem from the application all the way down to the silicon is a hallmark of innovative, impactful systems research.

Weaknesses

While the core design is powerful, the paper could be strengthened by broadening its focus to the programmability and long-term evolution of the architecture.

The Programmability Challenge: The DADU accelerator is highly specialized for the CORKI algorithm. A key challenge, which is not fully explored, is how a robotics developer would program or adapt it for a different VLM or a new manipulation task. The paper mentions an algorithm framework (Section 4.3, Page 7), but a deeper discussion of the programming model and the compiler toolchain would be critical for assessing the long-term viability and flexibility of the architecture.
Beyond Manipulation: The paper focuses on a specific class of robotic manipulation tasks. However, a real-world autonomous robot needs to do much more, including perception, localization (SLAM), and path planning. A discussion of how the DADU accelerator could be integrated into a larger, heterogeneous system-on-chip that also accelerates these other critical robotics tasks would provide a more complete picture of a full system solution.
The Pace of VLM Research: The paper co-designs an accelerator for today's VLM architectures. However, the field of large AI models is evolving at a breathtaking pace. A discussion of how the DADU architecture could be adapted or future-proofed to handle the next generation of VLM models, which might have different structures (e.g., Mixture-of-Experts, new attention mechanisms), would be a valuable addition.

Questions to Address In Rebuttal

Your work is a fantastic example of co-design. Looking forward, how do you envision the programming model for DADU? What new language abstractions or compiler techniques would be needed to allow a robotics developer to easily map a new or different VLM onto your architecture?
How do you see the DADU accelerator being integrated into a full, heterogeneous SoC for a real-world robot? Would it be a co-processor on a larger chip that also includes accelerators for perception and planning?
The CORKI algorithm is based on the Flamingo architecture. How would your DADU architecture need to evolve to efficiently handle a future VLM that uses a fundamentally different structure, such as a Mixture-of-Experts (MoE) model? 🤔
This work pushes intelligence closer to the robot's "muscles." What do you think is the next major bottleneck in robotics that needs to be solved through a similar, full-stack, algorithm/hardware co-design approach?

Review 3

Persona 3: The Innovator (Novelty Specialist)

Review Form

Summary

This paper introduces Dadu-Corki, a new system for robotic manipulation. The core novel claims are the synergistic co-design of two new components: 1) CORKI, a new robotic control algorithm that modifies the existing RoboFlamingo VLM architecture for hardware efficiency by, for example, using a fixed number of reasoning steps and a simplified action space (Section 2, Page 2). 2) DADU, a new, domain-specific dataflow accelerator architecture explicitly designed to accelerate the CORKI algorithm (Section 3, Page 4). The primary novel feature of the DADU architecture is its heterogeneous design, with specialized processing units and data paths for the different components of the CORKI model (e.g., the Perceiver, the Cross-Attention, and the Gated-Recurrent Unit).

Strengths

From a novelty standpoint, this paper's strength lies in its holistic, full-stack approach to innovation. It does not just propose a new algorithm or a new piece of hardware in isolation; it proposes a new, tightly-coupled system where both components are new and are designed to work together.

A Novel System-Level Co-design: The most significant "delta" in this work is the methodology itself: the tight, synergistic co-design of a new algorithm (CORKI) and a new hardware architecture (DADU) in a closed loop. While algorithm/hardware co-design is a known concept, this paper is a rare example of a work that presents novel contributions at both the algorithm and the architecture level within a single, cohesive project. The innovation is not just in CORKI or DADU, but in the Dadu-Corki system as a whole. 💡
A Novel Algorithm for Hardware Acceleration: While CORKI is an evolution of RoboFlamingo, its specific modifications—such as the fixed-iteration reasoning and the simplified action decoder—are novel algorithmic changes made explicitly for the purpose of enabling efficient hardware acceleration (Section 2.3, Page 4). This represents a new and important design point in the space of VLM-based control, prioritizing hardware-friendliness alongside task performance.
A Novel, Heterogeneous Dataflow Architecture: The DADU accelerator is a novel architecture. While dataflow accelerators are known, DADU's specific, heterogeneous pipeline—with dedicated hardware units and memory subsystems for the different phases of the CORKI algorithm (Figure 3, Page 6)—is a new and domain-specific design. It is not a generic VLM accelerator; it is a CORKI accelerator, and this specialization is a key part of its novelty.

Weaknesses

While the overall system is novel, it is important to contextualize the novelty of the individual components, which are largely clever adaptations of existing ideas.

Algorithmic Novelty is Evolutionary, Not Revolutionary: The CORKI algorithm is a clear and direct descendant of the Flamingo family of models. It does not propose a fundamentally new way of performing vision-language reasoning. Its novelty is in the specific, pragmatic trade-offs it makes to improve hardware efficiency, which is a significant engineering contribution but not a revolutionary algorithmic breakthrough.
Architectural Primitives are Known: The DADU architecture, while novel in its overall composition, is built from well-understood architectural primitives. It is a dataflow accelerator that uses standard components like systolic arrays, memory controllers, and specialized functional units. The novelty is not in the invention of a new type of processing element or memory system, but in the specific arrangement and specialization of these known components to create a new, application-specific pipeline.
Performance Gains are a Consequence of Novelty: The reported performance and efficiency gains are a direct and expected consequence of the novel, application-specific co-design. It is not a novel discovery that a custom ASIC is more efficient than a general-purpose GPU. The novelty is in the creation of the co-designed system that enables these gains, not in the gains themselves.

Questions to Address In Rebuttal

The core of your novelty is the co-design. Can you provide a specific example where a limitation in the DADU hardware design forced you to make a non-obvious, novel change to the CORKI algorithm that you would not have otherwise made?
The CORKI algorithm is an adaptation of RoboFlamingo. What is the most significant, novel insight you gained about the structure of VLMs for robotics by modifying the algorithm for hardware, an insight that would be valuable even for researchers who are not building custom hardware?
The DADU architecture is a heterogeneous dataflow design. Can you contrast your approach with prior work on other heterogeneous, domain-specific accelerators (e.g., for wireless baseband or signal processing)? What is the key "delta" that makes your architecture a fundamentally new approach?
If a new, superior VLM architecture for robotics were to be published tomorrow, which part of the Dadu-Corki system's novelty would be more enduring: the specific design of the DADU accelerator, or the more general methodology of algorithm-hardware co-design that you have demonstrated?

RTSpMSpM: Harnessing Ray Tracing for Efficient Sparse Matrix Computations

Abstract

The significance of sparse matrix algebra pushes the development of sparse matrix accelerators. Despite the general reception of using hardware accelerators to address application demands and the convincement of substantial performance gain, integrating ...

ACM LINK

Reviews

Review 1

Persona 1: The Guardian (Adversarial Skeptic)

Review Form

Summary

This paper proposes RTSPMSpM, a method for accelerating Sparse Matrix-Matrix Multiplication (SpMSpM) by offloading the computation to the dedicated Ray Tracing hardware (RT Cores) found on modern GPUs. The core idea is to represent the sparse matrices as geometric scenes and then use the hardware-accelerated Bounding Volume Hierarchy (BVH) traversal to find the overlapping non-zero elements required for the multiplication. The authors claim this approach provides a 1.85x speedup over the state-of-the-art cuSPARSE library.

Strengths

The paper correctly identifies a potentially underutilized, powerful hardware resource on modern GPUs.

Valid Problem Identification: The work is motivated by the desire to accelerate SpMSpM, which is a critical kernel in many scientific and machine learning applications. Seeking new hardware acceleration pathways for this problem is a valid research direction.

Weaknesses

The paper's conclusions are founded on a fundamentally flawed analogy, an incomplete performance analysis that ignores critical overheads, and an inequitable baseline comparison.

Fundamentally Flawed Architectural Analogy: The entire premise rests on a superficial similarity between SpMSpM and ray tracing. SpMSpM is fundamentally an irregular memory access and integer arithmetic problem. Ray tracing hardware is a highly specialized, complex engine designed for floating-point geometric intersection tests and spatial data structure traversal. Using a complex geometric engine to perform what is essentially a glorified database join on integer coordinates is a gross architectural mismatch. The mechanism is fundamentally inefficient, using complex ray-box intersection tests to perform what should be simple integer comparisons.
Critical BVH Construction Overhead is Ignored: The proposed method requires a costly pre-processing step to convert the sparse matrices into a BVH data structure (Section 3.2, Page 4). This BVH construction is not a "free" operation; it is a computationally intensive task that consumes significant time and memory. The paper's evaluation (Section 4.4, Page 6) appears to completely exclude the time taken for this construction from its end-to-end performance measurements. By only reporting the time for the traversal and merge steps, the reported speedups are artificially inflated and deeply misleading. The BVH construction time could easily negate any performance benefit from the hardware acceleration.
Unfair Baseline Comparison: The 1.85x speedup claim is based on a comparison against a general-purpose library (cuSPARSE) running on standard CUDA cores. The RT Core is a massive, power-hungry, fixed-function ASIC. A rigorous and fair comparison would require evaluating RTSPMSpM against a baseline where the standard CUDA cores are given an equivalent silicon area and power budget to the RT Core. It is highly probable that a larger array of CUDA cores or a more specialized SpMSpM unit built with the same resources would outperform this inefficient repurposing of a ray tracer.
Scalability is Unproven and Likely Poor: The evaluation is performed on a selection of matrices from the SuiteSparse collection (Section 4.1, Page 5). There is no analysis of how the performance, and particularly the BVH construction time, scales as the matrix dimensions and number of non-zero elements grow to the massive sizes seen in modern large-scale graph analytics or scientific simulations. BVH construction for large, unstructured point clouds (analogous to a sparse matrix) is a notoriously difficult and non-linear problem. The claim of general applicability is unsubstantiated.

Questions to Address In Rebuttal

To provide a fair and complete analysis, please report the end-to-end execution time of your method, including the BVH construction phase. How does the total time, including pre-processing, compare to the cuSPARSE baseline?
Please provide a sensitivity analysis showing how the BVH construction time scales with the matrix size (N) and the number of non-zero elements (NNZ). At what point does this pre-processing overhead overwhelm the benefits of the hardware traversal?
To provide a fair comparison, how does your approach compare to a baseline where the SpMSpM kernel is run on a hypothetical GPU with no RT Core, but where the CUDA cores are given an additional silicon area and power budget equivalent to that of the RT Core?
Can you justify the fundamental architectural choice of using a floating-point geometric intersection engine for an integer-based sparse linear algebra problem? What is the raw, cycle-for-cycle efficiency (in terms of useful multiply-accumulates per second per mm²) of your approach compared to a standard CUDA core implementation?

Review 2

Persona 2: The Synthesizer (Contextual Analyst)

Review Form

Summary

This paper, "RTSPMSpM," introduces a novel and creative approach to accelerating Sparse Matrix-Matrix Multiplication (SpMSpM) by harnessing the dedicated Ray Tracing (RT) hardware that is now a standard component of modern GPUs. The core contribution is a new algorithm that cleverly maps the abstract problem of SpMSpM onto the geometric domain of ray tracing. By representing the non-zero elements of the input matrices as geometric objects (Axis-Aligned Bounding Boxes) and building a Bounding Volume Hierarchy (BVH) over them, the authors are able to use the massively parallel, hardware-accelerated intersection testing capabilities of RT Cores to rapidly identify the overlapping terms required for the matrix product. This work opens up a new and unexpected pathway for accelerating sparse linear algebra by repurposing a powerful, ubiquitous, and often underutilized hardware resource.

Strengths

This paper's primary strength is its brilliant "out-of-the-box" thinking, which connects two seemingly unrelated computational domains to create a new and unexpected synergy. This is a work of great creativity and insight.

A Brilliant Repurposing of Specialized Hardware: The most significant contribution of this work is its clever and non-obvious idea to use a ray tracing engine for sparse linear algebra (Section 1, Page 2). Modern GPUs are becoming collections of specialized accelerators, and a key architectural challenge is avoiding "dark silicon." This paper provides a compelling solution by finding a "secondary purpose" for the RT Core, effectively democratizing a specialized graphics unit for a general-purpose scientific computing task. This is a powerful and important direction for heterogeneous computing. 💡
Connecting Disparate Computational Worlds: This work serves as an intellectual bridge between the worlds of computer graphics and high-performance scientific computing. It recognizes a deep algorithmic similarity between finding ray-object intersections in a 3D scene and finding row-column overlaps in a matrix multiplication. By showing how the SpMSpM problem can be formally reframed in the language of geometry (Section 2, Page 2), the paper provides a crucial Rosetta Stone that allows the scientific computing community to tap into the immense, highly-optimized hardware resources of the graphics world.
Enabling a New Class of General-Purpose Ray Tracing: This paper is a pioneering example of what could be called "General-Purpose Computing on Ray Tracing Units" (GPRT), a spiritual successor to GPGPU. It opens up a new field of research focused on finding other non-graphics algorithms that are fundamentally about search, intersection, or traversal, and mapping them onto RT hardware. This could have a significant impact on fields ranging from computational biology to database acceleration.

Weaknesses

While the core idea is brilliant, the paper could be strengthened by broadening its focus from a specific kernel to the larger system and software ecosystem.

The BVH Construction Bottleneck: The paper focuses on the acceleration of the core intersection phase, but the necessary pre-processing step of building the BVH is a significant undertaking. A more detailed discussion of how to optimize and accelerate this BVH construction phase, perhaps by using the CUDA cores in a pipelined fashion, would provide a more complete picture of a full system solution.
The Software Abstraction Layer: For this technique to be widely adopted, it needs to be integrated into high-level scientific computing libraries and frameworks (e.g., SciPy, MATLAB, Eigen). A discussion of the software and compiler challenges in building this abstraction layer—which would need to hide the complexity of the geometric mapping from the end user—would be a valuable addition. What does the API for a "ray-tracing-accelerated BLAS" look like?
Beyond SpMSpM: The paper successfully demonstrates the mapping for SpMSpM. The natural next question is: what other sparse algebra or graph algorithms could be accelerated with this approach? A discussion of how this geometric mapping could be generalized to other important kernels, such as Sparse Matrix-Vector Multiplication (SpMV) or graph traversal, would elevate the work from a clever trick for one problem to a truly general-purpose platform.

Questions to Address In Rebuttal

Your work brilliantly repurposes the RT Core. Looking forward, what other major computational problems, from any field of science or engineering, do you think are fundamentally "intersection problems" that could be mapped onto ray tracing hardware?
The BVH construction is a key overhead. How do you envision this being managed in a real-world scientific computing workflow? Should the BVH be constructed once and reused, and what are the trade-offs of this approach?
For this to have broad impact, it needs a clean software interface. What do you see as the biggest challenge in building a compiler or library that could automatically and transparently offload sparse algebra operations to the RT Core without the user needing to know anything about geometry or ray tracing? 🤔
If you were designing the next generation of GPUs, knowing about this potential use case, would you still design a separate, highly-specialized RT Core? Or would you create a more general-purpose, programmable "Traversal Engine" that could be used for both graphics and these new GPRT workloads?

Review 3

Persona 3: The Innovator (Novelty Specialist)

Review Form

Summary

This paper introduces RTSPMSpM, a new system for accelerating Sparse Matrix-Matrix Multiplication (SpMSpM). The core novel claim is the algorithmic mapping of the SpMSpM problem onto the hardware-accelerated ray tracing pipeline of modern GPUs. This is achieved through a series of novel techniques: 1) A new problem formulation that represents the non-zero elements of the input matrices as Axis-Aligned Bounding Boxes (AABBs) in a 2D space (Section 2.2, Page 3). 2) The application of the standard Bounding Volume Hierarchy (BVH) data structure to this novel geometric representation of a sparse matrix (Section 3.2, Page 4). 3) The use of hardware-accelerated ray-intersection tests to perform the row-column overlap computation. The synthesis of these steps to repurpose a graphics accelerator for a sparse algebra task is presented as the primary novel contribution.

Strengths

From a novelty standpoint, this paper is a significant contribution because it proposes a fundamentally new and non-obvious cross-domain algorithm. It does not merely optimize an existing method; it creates a new one by drawing an analogy between two previously disconnected fields.

A Novel Architectural Analogy: The most significant "delta" in this work is the conceptual leap required to see the algorithmic similarity between sparse matrix multiplication and ray tracing. While both can be viewed as search problems, the specific, concrete mapping of matrix indices to geometric coordinates and the use of intersection tests to represent the multiplication is a genuine and highly innovative insight. It is a new way of thinking about the problem that has not been explored in prior art. 🧠
A New Mechanism for General-Purpose Acceleration: This work is a pioneering example of a new class of general-purpose computation: "GPRT" (General-Purpose computing on Ray Tracing units). While GPGPU is a well-established field, prior work has focused on the programmable CUDA cores. This is the first paper to propose a detailed, end-to-end mapping of a major scientific computing kernel onto the fixed-function ray tracing pipeline. This opens up a new and unexplored avenue for hardware acceleration research.
Novel Problem Representation: The specific techniques used to enable the mapping are themselves novel. The representation of a sparse matrix as a collection of AABBs, where the geometry of the boxes encodes the row and column indices of the non-zero elements, is a new and clever data representation that is the key enabler for the entire system.

Weaknesses

While the core idea is highly novel, it is important to contextualize its novelty. The work proposes a new use for existing hardware and algorithms, but does not invent new fundamental hardware primitives or data structures.

Underlying Primitives are Not New: The novelty is purely in the abstraction and the mapping. The work uses standard ray tracing concepts and a standard data structure (the BVH). It does not propose a new type of traversal algorithm or a new hardware unit. The innovation is in the creative application of these existing tools to a new problem domain.
Novelty is in the "Hack": The solution is, in essence, a very clever "hack." It tricks a piece of hardware designed for one purpose into doing something entirely different. The novelty lies in the ingenuity of this trick, but it also means the mechanism is not a "natural" or fundamentally optimal way to perform the computation. The novelty is in making the imperfect analogy work, not in creating a perfect solution from first principles.
Performance is a Consequence, Not an Invention: The reported speedup is a direct and expected consequence of successfully offloading a parallelizable workload to a powerful, dedicated hardware unit. The novelty is in enabling this offload via the new algorithm, not in the discovery that hardware acceleration makes things faster.

Questions to Address In Rebuttal

Your work proposes a novel mapping of a linear algebra problem onto a graphics pipeline. Can you discuss any prior art in the broader history of GPGPU where a similarly non-obvious, domain-crossing mapping was used to accelerate a non-graphics workload?
The representation of the matrix as a collection of AABBs is a key enabling technique. Is this a point solution, or do you see this as a new, generalizable method for representing other abstract discrete structures (e.g., graphs, sets) in a geometric form to leverage graphics hardware?
The core of your novelty is the SpMSpM-to-ray-tracing algorithm. If a future GPU were to include a more general-purpose, programmable "Traversal Engine" instead of a fixed-function RT Core, how much of the novelty of your work would remain? Does the contribution lie primarily in overcoming the limitations of today's hardware?
What is the most surprising or unexpected challenge you faced when trying to map the precise, discrete world of integer linear algebra onto the continuous, floating-point world of geometric ray tracing?

ANVIL: An In-Storage Accelerator for Name–Value Data Stores

Abstract

Name– value pairs (NVPs) are a widely-used abstraction to organize data in millions of applications. At a high level, an NVP associates a name (e.g., array index, key, hash) with each value in a collection of data. Specific NVP data store formats can vary ...

ACM LINK

Reviews

Review 1

Persona 1: The Guardian (Adversarial Skeptic)

Review Form

Summary

This paper proposes ANVIL, an in-storage accelerator designed to speed up queries on Name-Value Pair (NVP) data stores. The system consists of a new API, a format-agnostic hardware accelerator called the Name-Value Processing Unit (NVPU) integrated into an SSD, and a software framework for registering NVP formats. By performing queries directly on the flash storage, ANVIL aims to eliminate the data movement overhead between the SSD and the host CPU, claiming significant performance and energy improvements.

Strengths

The paper correctly identifies a well-known and important performance bottleneck.

Valid Problem Identification: The central premise is sound: for many data-intensive applications, the cost of moving data from storage to the host processor over the PCIe bus is a dominant performance and energy bottleneck (Section 1, Page 1). Targeting this bottleneck is a valid research direction.

Weaknesses

The paper's conclusions are built upon a foundation of an inequitable baseline, an oversimplified analysis of system overheads, and unsubstantiated claims of generality and programmability.

Fundamentally Unsound Baseline Comparison: The headline performance claims (e.g., 269x speedup) are invalid and deeply misleading because the comparison is an apples-to-oranges fallacy. A custom, application-specific hardware accelerator (the NVPU) is compared against a general-purpose CPU running a single-threaded C++ implementation (Section 5.2, Page 11). An ASIC will always outperform a single CPU core on its target task. A rigorous evaluation would compare ANVIL against a highly-optimized, multi-threaded CPU baseline that uses modern libraries (e.g., Intel TBB, optimized hashing libraries) or, even more importantly, against a GPU-based solution, which is the standard for high-throughput data processing. The reported speedups are an artifact of a weak baseline, not a demonstrated architectural superiority.
Critical Overheads are Ignored and Minimized: The paper fails to properly account for the significant overheads of its own system.
1. Programming Overhead: The NVPU must be configured for each new query and NVP format. The paper provides no analysis of the latency or energy cost of programming the hardware, which could be substantial, especially for queries that operate on small amounts of data.
2. API and Driver Overhead: The ANVIL API introduces a new software layer. The cost of traversing this stack—from user space, through the OS kernel and driver, to the SSD—is not quantified. This overhead could easily dominate the execution time for low-latency queries.
3. "Bit-Funneling" Complexity: The "bit-funneling" mechanism (Section 4.3, Page 9) is presented as a general solution for handling diverse data formats, but its implementation details are sparse. The logic required to parse the format description, dynamically configure the data paths, and extract specific bit fields is highly complex. The area and power cost of this fully-general programmable logic is likely much higher than the paper's estimates, which appear to be based on a simple datapath.
Claim of Generality is Unproven: The paper claims ANVIL can accelerate "most formats of NVPs" (Abstract, Page 1). However, the evaluation is limited to a few simple, flat data structures like arrays and hash maps (Section 5.1, Page 11). There is no evidence that the architecture can efficiently handle more complex, real-world formats, such as nested structures (e.g., JSON), graph data stores, or formats requiring multi-step, dependent queries. The claim of general applicability is an unsubstantiated leap from a few simple examples.
Wear-Out and Reliability Concerns: The proposed mechanism encourages many small, random reads from the flash memory. This access pattern is known to be the worst-case scenario for NAND flash endurance and can significantly increase read disturb errors and reduce the lifetime of the SSD. The paper completely ignores this critical reliability issue.

Questions to Address In Rebuttal

To provide a fair comparison, please evaluate ANVIL against a state-of-the-art, multi-threaded CPU implementation and a GPU-based implementation of the same query workloads.
Provide a detailed, cycle-accurate performance breakdown of a full query, including the software overhead of the API call, the time to program the NVPU, the query execution time, and the time to return the result.
To substantiate your claim of generality, please demonstrate ANVIL's effectiveness on a complex, nested NVP format, such as parsing a large JSON file or performing a graph traversal query.
Please provide an analysis of the impact of ANVIL's access patterns on the endurance and long-term reliability of the underlying NAND flash. What is the expected reduction in the SSD's lifetime compared to a conventional workload?

Review 2

Persona 2: The Synthesizer (Contextual Analyst)

Review Form

Summary

This paper introduces ANVIL, an end-to-end system for accelerating queries on Name-Value Pair (NVP) data stores through in-storage processing. The core contribution is a complete, vertically-integrated solution that includes a new, generalized NVP data abstraction, a high-level API for offloading queries, and a programmable hardware accelerator, the Name-Value Processing Unit (NVPU), embedded within an SSD. By moving the computation to the data, ANVIL aims to circumvent the PCIe bus, which is a major bottleneck in modern data-intensive applications. This work represents a significant step towards creating truly general-purpose, programmable computational storage devices.

Strengths

This paper is a significant and forward-looking contribution that elegantly synthesizes ideas from database theory, computer architecture, and storage systems to create a powerful and practical solution to a fundamental problem.

A Pragmatic and General Approach to Computational Storage: The most significant contribution of this work is its creation of a general and extensible framework for computational storage. While prior work has focused on offloading specific, fixed functions (like compression or encryption), ANVIL is the first to propose a truly programmable engine built around a broad and fundamental data abstraction—the Name-Value Pair (Section 2, Page 2). This is a massive leap forward. It moves computational storage from a collection of niche point-solutions to a general-purpose platform, which is a critical step for widespread adoption. 🚀
Elegant Synthesis of the Full System Stack: ANVIL is a beautiful example of a true full-stack co-design. It doesn't just propose a piece of hardware; it presents a complete, end-to-end solution that includes the programmer-facing API, the software driver, the hardware microarchitecture, and the physical storage interface (Figure 1, Page 2). This holistic vision, which considers the problem from the application all the way down to the silicon, is a hallmark of mature and impactful systems research.
Connecting to the Core of Modern Applications: The work is brilliantly motivated by its focus on NVPs. As the paper correctly identifies, NVPs are a ubiquitous data abstraction, forming the foundation of everything from simple arrays and dictionaries to massive key-value stores and AI feature tables (Section 2.1, Page 3). By targeting this fundamental building block, the ANVIL framework has the potential for incredibly broad impact, accelerating a huge swath of modern data-intensive applications.

Weaknesses

While the core vision is powerful, the paper could be strengthened by broadening its focus to the software ecosystem and the long-term evolution of the storage landscape.

The "Killer App" and Software Ecosystem: The paper proves the potential of ANVIL with a set of microbenchmarks and specific applications. The next critical step for impact is integration into a major, real-world software system. For example, how could ANVIL be used as a transparent acceleration backend for a popular key-value store like RocksDB or a data analytics framework like Apache Spark? A discussion of the path to integration with a major open-source project would provide a clearer roadmap to real-world impact.
The Future of Storage Interconnects: ANVIL's primary benefit comes from avoiding the PCIe bus. However, new interconnect technologies like CXL (Compute Express Link) are emerging that promise to create a much more tightly-coupled, lower-latency connection between processors and devices. A discussion of how the value proposition of ANVIL changes in a world with CXL would be a fascinating and forward-looking addition.
Beyond Simple Queries: The paper focuses on relatively simple "scan-and-filter" style queries. A discussion of how the ANVIL architecture could be extended to support more complex, multi-stage queries (e.g., queries that require joins or aggregations) would be a valuable exploration of the architecture's future potential.

Questions to Address In Rebuttal

Your work provides a powerful new hardware capability. Looking forward, what do you see as the biggest challenge in integrating ANVIL into a major data-processing framework like Apache Spark to make its benefits transparently available to a broad base of users?
How does the emergence of CXL, which promises a high-bandwidth, cache-coherent interconnect between the CPU and devices, change the design trade-offs for in-storage processing? Does it reduce the need for a system like ANVIL, or does it create new opportunities? 🤔
How would you extend the ANVIL hardware and API to support more complex, multi-stage database queries, such as a "scan-filter-aggregate" pipeline, all within the SSD?
The NVP abstraction is very powerful. What other common data structures, beyond NVPs, do you think are a good target for a similar, generalized in-storage acceleration framework?

Review 3

Persona 3: The Innovator (Novelty Specialist)

Review Form

Summary

This paper introduces ANVIL, an end-to-end system for in-storage acceleration of queries on Name-Value Pair (NVP) data stores. The core novel claims are the synthesis of three components into a single, cohesive framework: 1) A new, generalized NVP Abstraction that formally describes a wide variety of NVP formats in a machine-readable way (Section 3, Page 5). 2) A new, programmable hardware accelerator, the Name-Value Processing Unit (NVPU), designed to be integrated into an SSD controller and to interpret the NVP Abstraction to perform queries (Section 4, Page 7). 3) A new API and software stack that allows a programmer to register NVP formats and offload queries to the NVPU. The creation of this complete, general-purpose, and programmable in-storage processing framework is presented as the primary novel contribution.

Strengths

From a novelty standpoint, this paper is a significant contribution because it proposes a fundamentally new, general-purpose framework for a problem that has previously only been addressed by specific, fixed-function point-solutions.

A Novel General-Purpose Framework for Computational Storage: The most significant "delta" in this work is its generality. While the concept of computational storage is known, prior art has been dominated by fixed-function accelerators for specific tasks (e.g., compression, encryption, database filtering for a single, predefined schema). ANVIL is the first work to propose a truly programmable and format-agnostic computational storage device. The combination of the NVP Abstraction and the programmable NVPU creates a new, flexible paradigm that has not been explored in prior work. It moves computational storage from a "feature" to a "platform." 🧠
A Novel Hardware/Software Interface: The NVP Abstraction, which allows a programmer to describe the bit-level layout of their data structure to the hardware, is a novel hardware/software interface. This is a significant departure from traditional fixed-ISA designs. It is a new, data-centric approach to programmability that is a perfect fit for the domain of in-storage processing.
A Novel Microarchitecture for Data-Intensive Search: The NVPU architecture itself, with its combination of a programmable "bit-funneling" unit and fixed-function comparators, is a novel microarchitectural design point specifically tailored for high-throughput, on-the-fly data parsing and filtering (Section 4.3, Page 9). It is not a general-purpose processor, but a new type of data-flow engine designed for the specific task of searching semi-structured data.

Weaknesses

While the overall framework is highly novel, it is important to contextualize its novelty. The work cleverly synthesizes many ideas, but the underlying technologies are adaptations of existing concepts.

Component Concepts are Inspired by Prior Art: The novelty is primarily in the synthesis and generalization, not in the invention of the base concepts from first principles.
- In-Storage Processing: The core idea is part of the well-established field of computational storage.
- Programmable Data Parsers: The NVPU's "bit-funneling" unit is conceptually similar to the programmable packet-parsing engines found in modern network interface cards (NICs) and network switches. The novelty is the application of this concept to the storage domain.
- API Offloading: The model of using a driver and API to offload tasks to a hardware accelerator is the standard model for all hardware acceleration (e.g., CUDA, DPDK).
The "First" Claim is Specific: The claim to be the "first" general-purpose framework for NVPs is a strong one, but it is specific. It does not invent the idea of in-storage processing, but it is the first to propose a credible path to making it truly flexible and widely applicable. The novelty is in the leap from fixed-function to programmable.

Questions to Address In Rebuttal

The core of your novelty is the general-purpose, programmable nature of ANVIL. How does your NVP Abstraction and NVPU design differ fundamentally from the programmable parsing and filtering engines used in modern SmartNICs (e.g., Mellanox BlueField, Intel IPUs)? What is the key "delta" that makes your work a novel contribution in this context?
The "bit-funneling" concept is central to the NVPU. Can you discuss any prior art in other domains (e.g., compilers, network protocol processing) that has used a similar, descriptor-driven approach to data extraction and parsing in hardware?
If a competitor were to propose an alternative in-storage solution based on a small, general-purpose RISC-V core instead of your specialized NVPU, what is the fundamental, enduring novelty of the ANVIL microarchitecture that would make it superior?
What is the most non-obvious or surprising data format that your generalized NVP Abstraction can describe, which would have been difficult or impossible to accelerate with prior, fixed-function computational storage devices?

UPP: Universal Predicate Pushdown to Smart Storage

Abstract

In large-scale analytics, in-storage processing (ISP) can significantly boost query performance by letting ISP engines (e.g., FPGAs) pre-select only the relevant data before sending them to databases. This reduces the amount of not only data transfer ...

ACM LINK

Reviews

Review 1

Persona 1: The Guardian (Adversarial Skeptic)

Review Form

Summary

This paper introduces Universal Predicate Pushdown (UPP), a framework for offloading database filter operations to an in-storage FPGA. The core idea is to compile user-defined functions (UDFs) and filter predicates, written in high-level languages like C++, into WebAssembly (WASM). This WASM module is then translated into a hardware circuit description (Verilog) that is programmed onto an FPGA inside a "smart" storage device. By compiling arbitrary predicates to hardware, the authors claim to provide a general-purpose in-storage processing solution, reporting significant speedups over a CPU-based baseline.

Strengths

The paper correctly identifies a well-known and important performance bottleneck.

Valid Problem Identification: The central premise is sound: for data-intensive analytics, the cost of moving data from storage to the host CPU is a first-order bottleneck (Section 1, Page 1). The goal of pushing computation (in this case, predicate evaluation) to the data is a valid and important research direction.

Weaknesses

The paper's conclusions are built upon a fundamentally flawed and inequitable baseline, a questionable compilation pipeline, and a complete failure to address the most critical challenges of real-world in-storage processing.

Fundamentally Unsound Baseline Comparison: The headline performance claims are invalid and grossly misleading. The paper compares a custom, application-specific FPGA accelerator against a general-purpose CPU running a database (e.g., MonetDB) (Figure 6, Page 9). An FPGA will almost always outperform a CPU on a highly-parallelizable, data-streaming task like filtering. This is a classic apples-to-oranges fallacy. A rigorous and fair comparison would require evaluating UPP against a state-of-the-art, multi-threaded CPU implementation that uses SIMD instructions, or against a GPU-based filtering solution. The reported speedups are an artifact of an unfair comparison, not a demonstrated architectural superiority.
Compilation and Reconfiguration Overheads are Ignored: The entire UPP flow depends on a complex, multi-stage compilation process: C++ to WASM, WASM to Verilog, and finally, synthesis and place-and-route to generate an FPGA bitstream (Figure 2, Page 4). This end-to-end compilation, especially the hardware synthesis step, is an extremely slow process that can take minutes or even hours for non-trivial predicates. The paper completely ignores this overhead in its performance analysis. The claim of supporting "ad-hoc queries" is absurd when the reconfiguration time for each new query predicate would be orders of magnitude larger than the data transfer time it is trying to save.
"Universality" is a Gross Overstatement: The paper claims to support "universal" predicate pushdown, but the evaluation is limited to a handful of simple filter predicates from the TPC-H benchmark (Table 1, Page 8). There is no evidence that the framework can handle the full complexity of modern SQL, including complex UDFs with state, non-trivial control flow, or dependencies on external libraries. The WASM-to-Verilog translator is likely to be extremely brittle and support only a small, simple subset of the WASM specification. The claim of universality is unsubstantiated.
Practical System-Level Issues are Ignored: The paper provides a cartoonish, high-level view of a smart storage device and ignores numerous real-world engineering challenges. How is the FPGA managed, programmed, and secured? How does the UPP framework handle complex data types, variable-length strings, and different file formats (e.g., Parquet, ORC)? How are errors and exceptions from the hardware-accelerated UDF propagated back to the host database? The work presents a toy system, not a practical one.

Questions to Address In Rebuttal

To provide a fair comparison, please evaluate UPP against a state-of-the-art, multi-threaded, SIMD-optimized CPU implementation and a GPU-based implementation of the same TPC-H filter predicates.
Please report the end-to-end time for a new, ad-hoc query, including the full C++-to-WASM-to-bitstream compilation and FPGA reconfiguration time. How does this "query compilation time" compare to the query execution time?
To substantiate your claim of "universality," please demonstrate that your framework can successfully compile and accelerate a complex, stateful User-Defined Function from a real-world analytics workload, not just a simple, stateless TPC-H predicate.
Please provide a detailed description of the runtime system required to manage the UPP FPGA. How does the host OS discover, program, and manage the lifecycle of the hardware-accelerated predicates, and how are runtime errors handled?

Review 2

Persona 2: The Synthesizer (Contextual Analyst)

Review Form

Summary

This paper introduces Universal Predicate Pushdown (UPP), a novel, end-to-end framework for enabling general-purpose, in-storage processing. The core contribution is a complete, vertically-integrated toolchain that allows developers to write arbitrary filter predicates in a high-level language (like C++), compiles them to a universal and portable intermediate representation (WebAssembly), and then automatically synthesizes a custom hardware accelerator for that predicate on an FPGA embedded in a smart storage device. By providing a seamless path from a high-level software abstraction to a low-level hardware implementation, UPP aims to make computational storage truly programmable and accessible, breaking the bottleneck of the host-storage interconnect for a wide range of database and analytics workloads.

Strengths

This paper is a significant and forward-looking contribution that elegantly synthesizes ideas from database systems, compiler theory, and reconfigurable hardware to create a powerful and practical vision for the future of computational storage.

A Pragmatic and General Approach to Programmable Storage: The most significant contribution of this work is its creation of a complete and plausible end-to-end toolchain for programmable storage. While the idea of computational storage has been around for decades, its adoption has been crippled by the lack of a viable programming model. UPP provides a brilliant solution to this problem by leveraging WebAssembly (WASM) as a universal, hardware-agnostic intermediate representation (Section 3, Page 4). This is a massive conceptual leap forward. It connects the world of smart storage to the vast and mature ecosystem of modern software compilers (e.g., LLVM), which is a critical step for making computational storage truly usable by everyday developers. 🚀
Elegant Synthesis of the Full System Stack: UPP is a beautiful example of a true full-stack co-design. It doesn't just propose a piece of hardware; it presents a complete, coherent vision that includes the programmer-facing language, the compiler and IR, the hardware synthesis flow, and the runtime system. This holistic approach, which considers the problem from the application all the way down to the FPGA bitstream, is a hallmark of mature and impactful systems research.
Enabling the Future of Data-Centric Computing: The work is brilliantly motivated by the "data gravity" problem—the fact that it is becoming easier to move compute to the data than to move massive datasets to the compute. UPP provides a concrete and compelling architectural blueprint for this data-centric future. By making storage devices active, programmable participants in the computation, UPP helps to blur the lines between storage and compute, which is a key trend in the evolution of modern data centers and high-performance computing systems.

Weaknesses

While the core vision is powerful, the paper could be strengthened by broadening its focus to the long-term challenges of deploying and managing such a flexible system.

The Reconfiguration Bottleneck: The paper focuses on the performance of the final, accelerated query but spends less time discussing the significant latency of the FPGA compilation and reconfiguration process. For a truly interactive, ad-hoc query environment, this "time-to-first-result" is critical. A discussion of how this compilation overhead could be mitigated—perhaps through a library of pre-compiled common functions or a JIT-like compilation flow—would provide a more complete picture of a production-ready system.
Security and Resource Management: A fully programmable storage device introduces significant new security and resource management challenges. How do you prevent a malicious or buggy user-defined predicate from compromising the integrity of the storage device? How do you manage and schedule the limited FPGA resources in a multi-tenant environment where many different users are trying to offload predicates? A discussion of the required OS and runtime support for managing this new class of device would be a valuable addition.
Beyond Filtering: The paper focuses on the canonical computational storage workload: predicate pushdown (filtering). However, a programmable engine like UPP could potentially accelerate other important in-storage tasks, such as data transformation, compression, or even simple aggregations. An exploration of how the UPP framework could be extended to support a broader class of in-situ data processing would be a fascinating direction for future work.

Questions to Address In Rebuttal

Your use of WASM as a hardware-agnostic IR is a brilliant idea. Looking forward, how do you see this approach evolving? Could a future version of UPP use a JIT (Just-In-Time) compilation flow to dramatically reduce the reconfiguration latency for new queries?
A programmable storage device is a powerful tool, but also a potential security risk. What new hardware and software mechanisms do you think are needed to create a secure, multi-tenant execution environment on the UPP FPGA? 🤔
How could the UPP framework be extended to support more complex, "near-data" computations beyond simple filtering, such as data aggregation or format transcoding, all within the storage device?
What do you see as the biggest non-technical barrier to the widespread adoption of a programmable storage framework like UPP? Is it the need for new programming models, the lack of standardized APIs, or something else entirely?

Review 3

Persona 3: The Innovator (Novelty Specialist)

Summary

This paper introduces Universal Predicate Pushdown (UPP), an end-to-end framework for in-storage processing. The core novel claim is the synthesis of a complete, language-agnostic toolchain that automatically translates high-level filter predicates into custom hardware accelerators on a smart storage device. This is achieved through a novel two-stage compilation process: 1) The use of WebAssembly (WASM) as a universal, hardware-agnostic intermediate representation (IR) for the predicate logic (Section 3, Page 4), a first for the computational storage domain. 2) A new WASM-to-Verilog compiler that automatically synthesizes a hardware datapath from the WASM representation (Section 4, Page 6). The creation of this seamless, fully-automated software-to-hardware pipeline for a storage device is presented as the primary novel contribution.

Strengths

From a novelty standpoint, this paper is a significant contribution because it proposes a fundamentally new, practical, and general-purpose programming model for a problem that has previously been addressed only with specific, fixed-function, or difficult-to-program solutions.

A Novel Programming Abstraction for Computational Storage: The most significant "delta" in this work is its use of WebAssembly as the hardware/software interface. While computational storage is a known concept, prior work has required developers to use low-level, non-portable hardware description languages (like Verilog or HLS C++) to program the device. UPP is the first work to propose a high-level, language-agnostic, and portable programming model for a smart storage device. This is a massive leap forward in usability and is a truly novel approach that could finally make programmable storage practical. 🧠
A New Compiler-for-Hardware-Synthesis Flow: The creation of a compiler that directly translates a software IR (WASM) into a hardware description (Verilog) for this domain is a novel and significant engineering contribution. While HLS tools exist, they are complex and require specialized programming expertise. The UPP compiler, which automatically generates a datapath from a restricted but universal software format, represents a new and more accessible path from software to custom hardware.
A New Level of Generality: The synthesis of these ideas creates a new level of generality. Prior work focused on accelerating specific, hard-coded predicates. UPP is the first framework to propose a credible path to accelerating arbitrary, user-defined predicates, thanks to its novel compilation flow. The novelty is in the leap from a fixed-function device to a truly universal and programmable one.

Weaknesses

While the overall framework is highly novel, it is important to contextualize its novelty. The work cleverly synthesizes many ideas from the compiler and hardware communities, but the underlying technologies are adaptations of existing concepts.

Component Concepts are Inspired by Prior Art: The novelty is primarily in the synthesis and the application to a new domain.
- WASM: WebAssembly is a well-established, standardized IR from the software world. The novelty is being the first to recognize its potential as an IR for hardware synthesis in the storage domain.
- High-Level Synthesis (HLS): The WASM-to-Verilog compiler is, in essence, a form of HLS. The novelty is in the choice of the source language (WASM) and the specific, data-flow architecture it targets, not in the general idea of compiling a high-level language to hardware.
- FPGA-based Accelerators: Using FPGAs as reconfigurable accelerators in storage devices is a known concept.
The "First" Claim is Specific: The claim to be the "first" universal framework is a strong one, but it is specific. It does not invent the idea of in-storage processing or HLS, but it is the first to combine them in this specific, powerful, and accessible way.

Questions to Address In Rebuttal

The core of your novelty is the use of WASM as an IR for hardware synthesis. Can you contrast your approach with prior work in the HLS community that has used other software-based IRs (like LLVM-IR) as a starting point for hardware generation? What is the key "delta" that makes WASM a fundamentally better or more novel choice for this specific domain?
The WASM-to-Verilog compiler is a key enabling technology. What is the most non-obvious or surprising challenge you faced when trying to map the semantics of a software-oriented ISA like WASM onto the parallel, spatial semantics of a hardware datapath?
If a competitor were to propose an alternative in-storage solution based on embedding a small, general-purpose RISC-V core that could interpret the WASM code directly (a "JIT-to-FPGA" vs. a "JIT-to-CPU" approach), what is the fundamental, enduring novelty of your ahead-of-time, full-synthesis approach that would make it superior?
What is the most complex or unexpected software predicate that your novel framework can compile and accelerate, which would have been difficult or impossible to implement using prior, more rigid HLS-based or fixed-function approaches to computational storage?

In-Storage Acceleration of Retrieval Augmented Generation as a Service

Abstract

Retrieval- augmented generation (RAG) services are rapidly gaining adoption in enterprise settings as they combine information retrieval systems (e.g., databases) with large language models (LLMs) to enhance response generation and reduce hallucinations. ...

ACM LINK

Reviews

Review 1

Persona 1: The Guardian (Adversarial Skeptic)

Summary

This paper proposes RAGX, an in-storage accelerator for Retrieval-Augmented Generation (RAG). The authors identify the Search & Retrieval phase of RAG as a key bottleneck when the knowledge base is stored on SSDs. To address this, RAGX offloads the vector search (specifically, the HNSW graph traversal) and document scoring/retrieval to a custom accelerator integrated into an SSD. The system is designed for a multi-tenant "as-a-Service" environment, featuring a custom scheduler and memory manager. The authors claim this approach provides significant end-to-end latency reduction and improves system throughput.

Strengths

The paper correctly identifies a clear and relevant performance bottleneck.

Valid Problem Identification: The central premise is sound: as the knowledge bases for RAG systems grow too large for DRAM, the latency of retrieving context from persistent storage (SSDs) will become a first-order performance bottleneck (Section 2, Page 2). Targeting this bottleneck is a valid and important research direction.

Weaknesses

The paper's conclusions are fundamentally undermined by a flawed and inequitable baseline, a failure to address critical system overheads, and an oversimplified representation of the problem.

Fundamentally Unsound Baseline Comparison: The headline performance claims are invalid because the comparison is an apples-to-oranges fallacy. A custom, application-specific ASIC (RAGX) is being compared to a general-purpose CPU and DRAM system (Section 5, Page 7). An ASIC will always be more power- and performance-efficient for its target workload. A rigorous and fair comparison would require evaluating RAGX against a state-of-the-art CPU baseline that is highly optimized for vector search (e.g., using SIMD, advanced memory prefetching, and multiple threads) or against another specialized vector accelerator. The reported speedups are an artifact of specialization, not a demonstrated architectural superiority over a comparable solution.
Critical Overheads are Ignored: The paper's analysis focuses on the core graph traversal and scoring but appears to completely ignore or minimize several critical overheads.
1. Programming and Scheduling Overhead: In a multi-tenant environment, the RAGX hardware must be constantly reconfigured and scheduled to handle requests from different users with different models and databases. The paper's scheduler is high-level (Section 4.3, Page 6), and there is no analysis of the latency or energy cost of this dynamic reconfiguration and context switching, which could be substantial.
2. Off-chip Communication within the SSD: The RAGX accelerator needs to communicate with the SSD's Flash Translation Layer (FTL) and the NAND flash chips themselves. This internal SSD network is a shared resource and a potential bottleneck. The paper provides no analysis of this internal communication overhead.
Graph Traversal is an Unsuitable Task for In-Storage Processing: The core task being accelerated is HNSW graph traversal. This is a memory-latency-bound problem with a highly irregular, pointer-chasing access pattern. This is precisely the worst-case workload for NAND flash, which is optimized for large, sequential block access. Offloading a latency-sensitive, random-access workload to a high-latency storage device is a fundamental architectural mismatch. The paper fails to provide a convincing argument for why this is a sensible design choice.
"As-a-Service" Claims are Unsubstantiated: The paper claims to be designed for a multi-tenant "as-a-Service" environment, but the evaluation uses a simple, homogeneous workload of concurrent queries to the same database (Section 5.2, Page 8). There is no analysis of a more realistic, heterogeneous workload with different vector models, different databases, and a mix of high- and low-priority queries. The claim of being a robust multi-tenant solution is not proven.

Questions to Address In Rebuttal

To provide a fair comparison, how does RAGX perform against a state-of-the-art, multi-threaded CPU baseline that uses a highly-optimized vector search library like FAISS with full SIMD exploitation?
Please provide a detailed analysis of the overhead of your multi-tenant scheduler. What is the end-to-end latency for a single, small query in a heavily-loaded system, including all scheduling and hardware reconfiguration delays?
Can you justify the fundamental decision to offload a latency-sensitive, random-access graph traversal workload to a high-latency NAND flash-based device? Please provide a detailed breakdown of the internal SSD access latencies and show that they do not dominate the end-to-end execution time.
To substantiate your "as-a-Service" claims, please evaluate your system under a heterogeneous workload that includes multiple, concurrent queries using different embedding models and targeting different vector databases stored on the same device.

Review 2

Persona 2: The Synthesizer (Contextual Analyst)

Review Form

Summary

This paper introduces RAGX, a complete, vertically-integrated system for accelerating Retrieval-Augmented Generation (RAG) through in-storage processing. The core contribution is a holistic, full-stack design that moves the computationally intensive "Search & Retrieval" phase of the RAG pipeline from the host CPU directly into a smart storage device. This is achieved through a co-design of a specialized hardware accelerator for vector search, an intelligent multi-tenant scheduler, and an optimized memory and data layout system. By co-locating the vector search computation with the massive knowledge bases stored on SSDs, RAGX aims to eliminate the host-storage communication bottleneck, enabling a new level of performance, efficiency, and scale for RAG-as-a-Service applications.

Strengths

This paper is a significant and timely contribution that sits at the cutting edge of AI, systems, and storage. Its strength lies in its deep, full-stack understanding of an important and emerging workload and its creation of a complete, end-to-end solution.

A Brilliant Example of Full-Stack, Workload-Driven Design: The most significant contribution of this work is its textbook execution of full-stack, workload-driven design. The authors have not just accelerated a single kernel; they have analyzed a complete, real-world application (RAG-as-a-Service), identified its key bottleneck, and designed a comprehensive hardware and software solution to address it (Section 2, Page 2). The tight integration of the hardware accelerator, the scheduler, and the memory system (Section 4, Page 5) is a hallmark of a mature and well-considered system design. 🚀
Enabling the Future of Enterprise AI: The practical impact of this work could be immense. RAG is rapidly becoming the dominant paradigm for deploying Large Language Models in the enterprise, as it allows them to be safely and securely augmented with proprietary, domain-specific knowledge. However, the cost and latency of performing vector search over massive, terabyte-scale knowledge bases is a major barrier to adoption. By providing a solution that is an order of magnitude more performant and efficient (Figure 5, Page 8), RAGX could be a key enabler for the widespread, cost-effective deployment of enterprise RAG.
Connecting Storage and AI at a Fundamental Level: This work is a pioneering example of the deep, synergistic integration of storage and AI. It moves beyond the simple model of storage as a passive repository of data and re-imagines it as an active, intelligent component of the AI pipeline. This concept of a "smart knowledge base" that can perform its own search and retrieval is a powerful and important architectural evolution that aligns perfectly with the broader trend of data-centric computing.

Weaknesses

While the core vision is powerful, the paper could be strengthened by broadening its focus to the programmability of the system and its interaction with the rapidly evolving AI landscape.

The Programmability Challenge: The RAGX accelerator is highly specialized for the HNSW vector search algorithm. A key challenge, which is not fully explored, is how the architecture would be adapted to support other, emerging vector search algorithms (e.g., those based on different graph structures or quantization techniques). A discussion of the programmability of the accelerator and the toolchain required to map new algorithms onto it would be a valuable addition.
Beyond Vector Search: The paper focuses on accelerating the vector search component of RAG. However, a full RAG pipeline involves other important steps, such as document decompression, parsing, and chunking. A discussion of how the RAGX architecture could be extended into a more general-purpose "in-storage RAG pipeline," capable of accelerating these other tasks as well, would be a fascinating direction for future work.
The Pace of AI Model Research: The paper evaluates its system with a set of current embedding models. However, the field of representation learning is evolving at a breathtaking pace. A discussion of how the RAGX system would need to adapt to a future where embedding models become much larger, or where the nature of the "retrieval" task itself changes (e.g., retrieving structured data or code instead of just text), would be valuable.

Questions to Address In Rebuttal

Your work is a fantastic example of co-design for a specific algorithm. Looking forward, how would you make the RAGX accelerator more programmable to support future, as-yet-unknown vector search algorithms?
How do you see the RAGX concept evolving from a vector search accelerator into a more complete, in-storage "RAG pipeline" that could also offload tasks like document parsing and chunking?
The RAG paradigm is currently dominated by dense vector retrieval. How would your architecture need to change to efficiently support a future where retrieval is based on different modalities (e.g., sparse vectors, images, or even small neural networks)? 🤔
This work pushes a key part of the AI stack into the storage device. What do you think is the next major component of the modern data center stack that is ripe for a similar, full-stack, in-storage acceleration approach?

Review 3

Persona 3: The Innovator (Novelty Specialist)

Summary

This paper introduces RAGX, a new, in-storage acceleration system for Retrieval-Augmented Generation (RAG). The core novel claim is the holistic, full-stack co-design of a hardware/software system specifically for the "Search & Retrieval" phase of RAG-as-a-Service. The novel components are: 1) The RAGX accelerator, a new, domain-specific hardware architecture designed to accelerate HNSW graph traversal and document scoring directly within an SSD (Section 4, Page 5). 2) A multi-tenant scheduling and memory management system designed to support concurrent queries in a service environment (Section 4.3, Page 6). 3. The end-to-end synthesis of these components into the first published in-storage accelerator for the RAG workload.

Strengths

From a novelty standpoint, this paper is a significant contribution because it proposes a complete, new system architecture for a modern and important emerging workload. It does not just optimize a small piece of the problem; it presents a novel, end-to-end solution.

A Novel System-Level Architecture for a New Workload: The most significant "delta" in this work is that it is the first paper to identify, characterize, and design a specialized hardware accelerator for the RAG workload. While in-storage processing and vector acceleration are known concepts in isolation, this work is the first to synthesize them into a complete, cohesive system explicitly designed for the unique demands of RAG-as-a-Service. The RAGX architecture, with its tight coupling of the HNSW traversal engine, the scoring unit, and the multi-tenant scheduler, is a fundamentally new architectural design point. 🧠
A Novel In-Storage Graph Traversal Engine: The core of the RAGX accelerator is a hardware engine for traversing the HNSW graph. While graph accelerators are known, the RAGX engine is a new and specific design that is highly specialized for the memory access patterns and computational needs of HNSW (Section 4.1, Page 5). This is not a general-purpose graph processor; it is a novel, domain-specific traversal engine, and its design is a key contribution.
A Novel Approach to Multi-Tenancy in Computational Storage: The paper's focus on a "as-a-Service" model and its inclusion of a hardware-level multi-tenant scheduler is a novel and important contribution to the field of computational storage. Prior work has largely focused on single-user, single-application scenarios. The RAGX scheduler, which manages concurrent contexts and prioritizes requests, is a new and necessary component for making in-storage processing practical in a real-world, shared environment (Section 4.3, Page 6).

Weaknesses

While the overall system is highly novel, it is important to contextualize its novelty. The work cleverly synthesizes many ideas from different domains, but the underlying technologies are adaptations of existing concepts.

Component Concepts are Inspired by Prior Art: The novelty is primarily in the synthesis and the application to a new domain, not in the invention of the base concepts from first principles.
- In-Storage Processing: The core idea is part of the well-established field of computational storage.
- Vector Search Acceleration: The use of hardware to accelerate Approximate Nearest Neighbor (ANN) search is a known area of research.
- Graph Traversal: The core operation is a graph traversal, and specialized hardware for graph processing is an active research field.
The "First" Claim is Specific: The claim to be the "first" in-storage accelerator for RAG is a strong one, but it is specific to this new and emerging workload. The novelty lies in being the first to identify RAG as a key driver for computational storage and to design a complete system for it.

Questions to Address In Rebuttal

The core of your novelty is the full-stack design for the RAG workload. Can you contrast your approach with prior work on general-purpose, in-storage graph accelerators? What is the key "delta" in your architecture that makes it uniquely suited for HNSW traversal, a feature that a general-purpose graph engine would lack?
The multi-tenant scheduler is a key component for the "as-a-Service" model. How is this scheduler fundamentally different from the QoS and scheduling mechanisms found in modern, high-end enterprise SSD controllers? What novel capabilities does your scheduler have that are specific to the RAG workload?
If a new, superior ANN algorithm were to replace HNSW in the future, which part of the RAGX system's novelty would be more enduring: the specific design of the HNSW traversal engine, or the more general, full-stack methodology of co-designing an in-storage accelerator for a specific service-oriented workload?
What is the most non-obvious or surprising architectural trade-off you had to make when designing a system that is optimized for both the irregular, latency-sensitive graph traversal of HNSW and the regular, throughput-oriented streaming of the final document retrieval?

Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization

Abstract

Modern Large Language Model (LLM) serving system batches multiple requests to achieve high throughput, while batching attention operations is challenging, renderingmemory bandwidtha critical bottleneck. Today, to mitigate this issue, the community ...

ACM LINK

Reviews

Review 1

Review Form: The Guardian

Summary

The authors present Oaken, a solution for accelerating LLM inference serving by quantizing the Key-Value (KV) cache. The core contribution is a hybrid online-offline quantization algorithm. This approach involves profiling KV cache distributions offline to establish static outlier thresholds. These thresholds are then used online to partition KV values into three groups (inner, middle, outer). The authors propose a "group-shift" technique to narrow the dynamic range of outlier groups for better low-bit quantization and a "fused dense-and-sparse" encoding scheme to reduce the storage overhead of sparse outliers. The algorithm is implemented with custom hardware modules intended for integration into LLM accelerators. The evaluation, performed using a custom simulator, claims significant throughput improvements over baselines like NVIDIA A100 with vLLM and other quantization methods, with minimal accuracy degradation.

Strengths

Algorithmic Novelty: The concept of combining offline profiling with online quantization to reduce runtime overhead is sound. The proposed group-shift and fused dense-sparse encoding techniques are clever algorithmic contributions aimed at maximizing the compression ratio while managing quantization loss.
Problem Relevance: The paper correctly identifies the KV cache as a primary bottleneck in batched LLM inference, a critical and well-recognized problem in the field. Addressing both memory capacity and bandwidth constraints is the correct focus.
Comprehensive Hardware Design: The authors go beyond a purely algorithmic proposal by detailing a hardware implementation (Section 5), including quantization/dequantization engines and a specialized MMU. This demonstrates a thorough, system-level consideration of the problem.

Weaknesses

My review identifies several critical weaknesses that challenge the validity and robustness of the paper's central claims.

Unsupported Foundational Assumption: The entire premise of the "offline" profiling strategy rests on the assumption that KV cache value distributions are largely independent of the input data distribution (Observation 2, Section 4.1, page 6). The authors provide evidence for this using four NLP datasets (Wikitext2, PIQA, Hellaswag, Winogrande) which are stylistically similar. This is insufficient proof. The methodology is critically flawed if these static thresholds do not generalize to out-of-distribution prompts, such as those involving code generation, structured data, or multilingual text. The paper presents no evidence to support this crucial generalization, rendering the "offline" component potentially brittle.
Misleading and Inconsistent Accuracy Claims: The abstract claims an impressively "minimal accuracy loss of only 0.54% on average, compared to state-of-the-art KV cache quantization techniques." However, the data in Table 2 (page 11) does not substantiate this claim clearly and contradicts it in some cases.
- Comparison to FP16 Baseline: The actual accuracy loss compared to the FP16 Original baseline is often significantly higher. For example, on Llama2-7B with Hellaswag, the accuracy drops from 75.98% to 73.72%, a loss of 2.26%. On Winogrande, the drop is from 69.13% to 67.64%, a loss of 1.49%. It is unclear how the 0.54% figure was calculated; it seems to selectively ignore the worst-case results.
- Comparison to SOTA: The claim of superiority over other quantization methods is also inconsistent. For Llama2-7B on Wikitext2, Oaken's perplexity (5.53) is worse than KVQuant's (5.49). On PIQA, Oaken's accuracy (78.29%) is lower than KVQuant's (78.35%). While Oaken may perform better on other metrics, the claim of consistent superiority is an overstatement not fully supported by the authors' own data.
Use of Unjustified Global Hyperparameters: The authors fix the quantization group ratios to 4% outer, 90% middle, and 6% inner for all experiments across all models and datasets (Section 6.1, page 10). The justification provided in Figure 12(a) (page 12) is based solely on a single model (Llama2-7B) on a single dataset (Wikitext2). This is a classic case of over-fitting a hyperparameter to a narrow experimental setup and then generalizing without evidence. There is no reason to believe this ratio is optimal, or even good, for other models with different architectures (e.g., Mixtral-8x7B) or sizes. This undermines the robustness of the reported results.
Inequitable Baseline Comparisons: The performance evaluation methodology appears biased. In Section 6.1, the authors state they "disable" weight and activation quantization features for baselines like QServe and Tender for a "fair comparison." This is not a fair comparison. These are integrated systems, and their performance may rely on the interplay of all quantization features. By selectively disabling core functionalities, the authors are not comparing Oaken to the true state-of-the-art but to a crippled version of it. A proper evaluation must compare against these systems in their fully-enabled, intended configurations.
Conflation of Algorithmic and System-Level Gains: The paper presents Oaken-LPDDR as the top-performing configuration for large-scale serving. However, a significant portion of its ability to handle larger batches comes from the system-level choice of high-capacity LPDDR memory, not solely from the quantization algorithm. The paper needs to more rigorously disentangle the performance gains directly attributable to the compression from the Oaken algorithm versus the gains from the underlying memory hardware choice. The algorithm enables the use of LPDDR, but it does not, by itself, grant the capacity.

Questions to Address In Rebuttal

Please provide a precise, step-by-step calculation demonstrating how the "0.54% average accuracy loss" figure was derived from the results in Table 2. Explicitly state which baseline was used for this comparison and justify this choice.
The core assumption of data-independent KV cache distribution is foundational. Please provide experimental evidence showing that the offline-profiled thresholds are robust for out-of-distribution tasks, such as code generation or mathematical reasoning.
Please justify the use of a single, global 4%/90%/6% group ratio across all models. Provide a sensitivity analysis for this hyperparameter on at least one other model, such as Mixtral-8x7B or OPT-30B, to demonstrate that the chosen configuration is not simply cherry-picked.
Can the authors justify their decision to disable core features of the baseline systems (QServe, Tender)? Please either provide results comparing Oaken to these systems in their fully-optimized, published configurations or provide a stronger argument for why the current comparison is fair and representative.
Regarding the results in Figure 13 (page 12), how much of Oaken-LPDDR's ability to handle longer sequences (16K+) is due to the capacity of LPDDR versus the compression from the Oaken algorithm? Can you quantify the contribution of each?

Review 2

Review Form: The Synthesizer

Summary

This paper introduces Oaken, a hardware-software co-designed solution to address the critical key-value (KV) cache bottleneck in large language model (LLM) inference serving. The core contribution is a novel online-offline hybrid quantization algorithm. This approach performs a one-time, offline profiling of a given model to determine robust thresholds for identifying outlier values in the KV cache. These pre-computed thresholds are then used during live inference to perform a lightweight, dynamic online quantization, separating values into inlier and outlier groups without the need for expensive online sorting or analysis.

To translate this algorithmic gain into performance, Oaken proposes custom hardware modules—including quantization/dequantization engines and a specialized memory management unit (MMU)—designed to be integrated into an LLM accelerator. The authors implement and evaluate this system, demonstrating significant throughput improvements (up to 1.58× over state-of-the-art methods on an A100 GPU) with minimal accuracy degradation. The work situates itself as a practical and effective solution to the conflicting demands for high memory bandwidth and high memory capacity in modern LLM serving.

Strengths

Elegant Core Idea: The central concept of an online-offline hybrid approach is a particularly insightful contribution to the field of KV cache quantization. The current landscape is largely divided between:
- Fully Online Methods (e.g., KVQuant [22], KIVI [43]), which offer high accuracy by dynamically identifying outliers for every request but pay a steep performance penalty for sorting and mixed-precision handling.
- Static/Reordering Methods (e.g., QServe [41], Atom [86]), which have lower runtime overhead but may sacrifice accuracy by relying on less adaptive, coarse-grained heuristics like channel reordering.
Oaken carves out a clever and pragmatic middle ground. The insight that a model's activation distributions have stable, intrinsic properties that can be captured once offline (as shown in Section 4.1, page 6) is powerful. This allows the system to reap the benefits of outlier-aware quantization without bearing the crippling cost of dynamic discovery, effectively finding a sweet spot on the accuracy-performance curve.
Holistic System Co-Design: A major strength of this paper is its commitment to a full-stack solution. The authors do not merely propose an algorithm and speculate on its performance benefits. Instead, they meticulously design the hardware necessary to make the algorithm effective. The design of the custom quantization engine, the fused dense-and-sparse encoding scheme (Section 4.5, page 8), and the dual-table MMU for handling the resulting data structures (Section 5.2, page 9) demonstrates a deep understanding of the practical challenges. This co-design makes the impressive performance results far more credible than those in algorithm-only papers, as it directly accounts for implementation overheads.
Excellent Contextualization and Problem Framing: The paper does an outstanding job of positioning its work within the broader research landscape. The introductory discussion and, in particular, Figure 1 (page 2), provide a clear and compelling map of the solution space for LLM inference. By plotting existing works on a "bandwidth-capacity trade-off" axis, the authors immediately establish the context and significance of their contribution. This framing helps the reader understand that Oaken is not just another quantization paper, but a systemic attempt to push the Pareto frontier of serving efficiency. The analysis in Section 3, which motivates the work by examining the limitations of HBM vs. LPDDR memory, further grounds the research in tangible, real-world system design constraints.

Weaknesses

While the paper is strong, its central premise rests on an assumption whose boundaries could be explored more deeply.

Robustness of Offline Profiling: The entire approach is predicated on "Observation 2" (Section 4.1, page 6)—that the range of KV cache values is consistent across different input datasets. The authors validate this on several standard academic benchmarks. However, in real-world deployment, LLM services encounter a vast and unpredictable range of inputs, including adversarial prompts, out-of-distribution topics (e.g., code generation, non-English languages), and different fine-tuning domains. The paper would be strengthened by a discussion on the sensitivity of the offline-generated thresholds. How gracefully does the system degrade if it encounters inputs that generate activation distributions that deviate significantly from the profiling set? A sensitivity analysis could better define the operational envelope of Oaken.
Disentangling Algorithmic vs. Hardware Gains: The co-design is a strength, but it also makes it slightly difficult to isolate the benefit of the Oaken algorithm itself from the custom accelerator it runs on. The baselines are either GPU-based systems (vLLM, QServe) or another accelerator (Tender). While the latency breakdown in Figure 12b (page 12) is useful, a direct throughput comparison of an "Oaken-on-GPU" software kernel against other GPU baselines in Figure 11 would be highly illuminating. This would allow the community to understand how much of Oaken's advantage stems from its more efficient algorithm versus the inherent benefits of a specialized ASIC implementation, providing a clearer picture of its applicability to existing commodity hardware.

Questions to Address In Rebuttal

Could the authors comment on the robustness of the offline profiling step? Have they investigated how sensitive the chosen thresholds are to more dramatic domain shifts in input prompts (e.g., from prose to source code, or across different languages)? What would be the recommended procedure for a production system—is one-time profiling on a general-purpose dataset sufficient, or would you recommend re-profiling for specific, fine-tuned models or domains?
The hardware-software co-design is a key feature. To help the community better appreciate the algorithmic contribution, would it be possible to provide throughput data for a CUDA-kernel implementation of the Oaken quantization algorithm running on the A100 GPU? Placing this as a baseline in Figure 11 would clarify how much of the performance gain over methods like KVQuant comes from the superior algorithm (avoiding online sorting) versus the custom hardware.
The paper proposes Oaken as a set of modules that can be integrated with "any LLM accelerators." Could the authors briefly elaborate on the practical considerations for such an integration? For example, what are the primary interface requirements between the Oaken DMA unit and a host accelerator's processing cores and memory subsystem?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents Oaken, an algorithm-hardware co-design for accelerating batched LLM inference by quantizing the Key-Value (KV) cache. The authors identify the high cost of online outlier detection in existing mixed-precision KV cache quantization schemes as a key performance bottleneck.

The central claim to novelty lies in a hybrid online-offline approach. The authors propose to: 1. Offline Profile for Thresholds: Determine static thresholds that partition KV cache values into three groups (inner, middle, outer) based on pre-computed statistics from sample prompts. 2. Online Scale Calculation: During inference, dynamically calculate the quantization scale (min/max) for each of the three groups online, but only for the newly generated token. 3. Group-Shift Quantization: For the outer and middle groups (outliers), shift the values by the offline-determined thresholds before quantization to narrow their dynamic range. 4. Fused Dense-and-Sparse Encoding: Propose a hardware-aware memory layout where a portion of the sparse outlier data is stored in the zeroed-out entries of the dense inlier matrix to reduce metadata overhead.

These algorithmic choices are coupled with custom hardware units (quantization/dequantization engines and an MMU) designed to accelerate this specific workflow. The authors claim this co-design achieves superior throughput compared to existing GPU-based solutions with minimal accuracy loss.

Strengths

The primary strength of this work is the clever synthesis of existing quantization concepts into a novel, high-performance system designed to solve a well-defined problem. The novelty is not in a single disruptive idea, but in the specific combination and refinement of several techniques.

The Online-Offline Hybrid Scheme: The core idea of using offline profiling to determine static group thresholds while calculating quantization scales online is a novel and pragmatic solution to the performance-accuracy trade-off. It correctly identifies that online sorting (e.g., topK in KVQuant [22]) is prohibitively expensive, while a purely static, offline approach may lack adaptivity. This hybrid method appears to be a new point in the design space.
Group-Shift Quantization: The technique of subtracting offline-derived thresholds from outlier values before quantization (Section 4.4, page 7) is a non-trivial algorithmic contribution. It directly addresses the challenge of quantizing wide-range values to a low bitwidth without resorting to higher-precision formats (e.g., FP16), thereby reducing the overhead associated with mixed-precision data handling.
Fused Dense-and-Sparse Encoding: This memory layout optimization (Section 4.5, page 8) is a clever, hardware-aware contribution. While storing outliers sparsely is not new (e.g., SqueezeLLM [30]), repurposing the zeroed entries in the dense matrix to store part of the outlier data is a specific and novel technique for reducing the significant metadata overhead of sparse formats.

Weaknesses

From a novelty perspective, the primary weakness is that the constituent components of the proposed solution are variations on well-established themes in quantization and systems design. The work is more of an expert-level integration and optimization than a fundamental conceptual breakthrough.

Incremental Algorithmic Concepts:
- The concept of identifying and separately handling outliers is the foundation of most modern high-accuracy quantization schemes (e.g., LLM.int8() [15], SmoothQuant [71], Olive [19]).
- The extension from a binary inlier/outlier split to a ternary inner/middle/outer grouping (Section 4.3, page 6) is an incremental, rather than fundamental, advancement. The motivation to preserve small-magnitude values is well-documented in prior art [2, 13, 27, 34], which the authors cite.
- The "group-shift" technique, while effective, is conceptually analogous to affine quantization, where a zero-point is used to shift the data range. Here, the shift value is cleverly tied to the outlier threshold, but the underlying principle of data shifting pre-quantization is not new.
Coupled Hardware Novelty: The hardware modules described in Section 5 (page 8) are themselves direct, albeit efficient, implementations of the proposed algorithm. Their novelty is therefore coupled to, and not independent of, the algorithmic novelty. The idea of building custom hardware for quantization/dequantization is not new; the contribution is that this hardware is purpose-built for the specific Oaken algorithm. As such, the hardware's novelty cannot be evaluated in isolation.

Questions to Address In Rebuttal

Delineation from Prior Hybrid Approaches: Could the authors more precisely delineate the novelty of their online-offline hybrid scheme against prior works that use offline calibration data to simplify online computation? For instance, SmoothQuant [71] uses an offline calibration set to determine scaling factors that are applied statically during inference. While the mechanism is different, the principle of using offline analysis to avoid expensive online calculations is shared. Please clarify what makes the "offline thresholds, online scaling" combination fundamentally new.
Necessity of the Three-Group Design: The paper justifies the three-group (inner/middle/outer) design to handle both large- and small-magnitude outliers. However, Table 3 (page 12) explores variations with four and five groups but does not include a direct comparison to a simpler two-group (inlier/outlier) scheme that uses the same group-shift and fused encoding techniques. Is the complexity of the third group (and its associated thresholds) essential for the reported performance, or would a two-group system achieve comparable results with less overhead? This is critical to justifying the novel complexity introduced.
Generality of Fused Encoding: The fused dense-and-sparse encoding is presented as a key optimization for reducing the bitwidth of an outlier entry from 23 to 8 bits. How sensitive is this technique to the specific bitwidths chosen (4-bit inlier, 5-bit outlier)? Does the benefit and the 8-bit alignment hold if, for example, 3-bit or 2-bit quantization were used for inliers, which is a direction of active research? Please comment on the generality of this novel encoding scheme beyond the specific configuration presented.

LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference

Abstract

Large Language Model (LLM) inference becomes resource-intensive, prompting a shift toward low-bit model weights to reduce the memory footprint and improve efficiency. Such low-bit LLMs necessitate the mixed-precision matrix multiplication (mpGEMM), an ...

ACM LINK

Reviews

Review 1

Paper Title: LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference Reviewer: The Guardian

Summary

This paper proposes "LUT Tensor Core," a software-hardware co-design to accelerate mixed-precision GEMM (mpGEMM) for low-bit LLM inference using a lookup table (LUT) based approach. The authors identify that existing LUT-based methods suffer from high overheads and propose a series of optimizations. On the software side, they introduce operator fusion for table precomputation, weight reinterpretation for table symmetrization, and table quantization. On the hardware side, they propose a bit-serial LUT unit with an "elongated" tiling shape to maximize table reuse. These components are integrated via a new set of "LMMA" instructions and a compilation stack.

While the paper addresses a relevant problem and presents several plausible optimizations, the evaluation methodology is beset by fundamental flaws and questionable assumptions. The headline claims of outperforming state-of-the-art commercial GPUs rely on an invalid cross-technology comparison, an unverified custom simulator, and modifications to the baseline GPU architecture. Consequently, the reported performance gains are not credible.

Strengths

Problem Formulation: The work correctly identifies a critical and underexplored problem: the lack of native hardware support for mpGEMM and the inefficiencies of both dequantization-based and conventional LUT-based software approaches.
Table Symmetrization Technique: The weight reinterpretation technique described in Section 3.1.2 (Page 5) to halve the LUT size by exploiting symmetry is a clever and sound optimization. It is a clear, self-contained contribution.
Well-Controlled Ablation Study: The comparison against a re-implemented UNPU baseline in Table 2 (Page 11) appears to be a well-controlled experiment. This analysis, which shows a 1.44x improvement in compute intensity and power efficiency, is the most believable result in the paper as it compares two designs under the same constraints.

Weaknesses

The paper's conclusions are built upon a foundation of severe methodological weaknesses, which I will detail below.

Fundamentally Invalid Cross-Technology Comparison: The central claims of the paper, summarized in Table 1 (Page 11), are derived by comparing the authors' proposed design, simulated on a 28nm process, against NVIDIA's A100 and H100 GPUs, which are built on 7nm and 4nm processes, respectively. The footnote (†) acknowledges this and claims the data is "normalized," but this normalization is insufficient and scientifically unsound. Moore's Law and Dennard scaling provide monumental, non-linear gains in performance, power, and area (PPA) that cannot be papered over by simple frequency scaling. Comparing a 28nm design to a 7nm one is not an apples-to-apples comparison; it is an apples-to-spaceships comparison. The resulting claims of 20.9x compute density improvement are therefore meaningless.
Reliance on a Custom, Unvalidated Simulator for End-to-End Results: The authors state in Section 4.4 (Page 8) that they developed a custom "tile-based simulator" for end-to-end evaluation because a validated simulator like Accel-Sim is too slow. This is a critical flaw. The performance of a complex system like a GPU depends heavily on the intricate interplay of the memory hierarchy, interconnects, and resource contention—factors that are notoriously difficult to model accurately. The authors provide a cursory validation in Figure 16 (Page 9) showing a ~5% error on a single layer, but this is insufficient to establish trust in a new, custom simulation tool for evaluating entire models. All the major end-to-end speedup claims in Figure 17 and Table 1 are based on this unverified tool, rendering them suspect.
Unjustified Architectural Modifications in Kernel-Level Simulation: In the kernel-level evaluation (Section 4.3, Page 8), the authors use Accel-Sim but introduce a "register capacity adjustment." They admit that "insufficient registers... restrict large tiling" and their reported performance gains are contingent on this modification. This means they are not evaluating their design on a stock A100 architecture, but on a hypothetical one with a larger register file that is conveniently tailored to their method's needs. This is a significant caveat that is not sufficiently highlighted and invalidates the direct comparison to the real A100's performance.
Incomplete Analysis of Proposed Hardware: The paper champions an "elongated" M2N64K4 tiling shape as optimal (Section 3.2.2, Page 6), arguing it maximizes table reuse. However, this argument is one-sided. Such a tiling shape has significant architectural implications for data movement. It requires broadcasting activation data across a wide array of 64 units, potentially creating a wiring and power bottleneck. Furthermore, the trade-offs in terms of register file port pressure and control logic complexity are completely ignored. Without a thorough analysis of these costs, the optimality of this tiling shape is an unsubstantiated claim.
Inconsistent and Overstated Claims: The abstract and introduction promise to unlock the "full potential" of LUT-based approaches, which are shown to be highly inefficient in the authors' own baseline tests (Figure 4, Page 4). The baseline LUT-GEMM even suffers from a "Seg. Error," which calls into question the quality and stability of the baseline software implementation they are improving upon. Furthermore, the claim that precomputation overhead is reduced to "almost zero" (Section 3.1.1, Page 5) is an exaggeration contradicted by their own results in Table 4 (Page 12), which show a remaining overhead of 2.5-2.6%. While small, it is not zero.

Questions to Address In Rebuttal

Please provide a rigorous justification for comparing a 28nm simulated design against 7nm/4nm fabricated hardware. How does your "normalization" method account for the non-linear scaling of transistor density, wire capacitance, and leakage power between these process nodes?
Regarding the kernel-level evaluation (Section 4.3), what exactly is the "register capacity adjustment"? Please quantify the increase in register file size per SM required to achieve your reported results and provide a PPA analysis of this modification. How would your performance compare to the baseline on a stock A100 without this change?
Can you provide a more detailed architectural validation of your custom end-to-end simulator? Specifically, how does it model the cache hierarchy (L1/L2), memory bandwidth contention from multiple SMs, and the on-chip network-on-chip (NoC)?
Please elaborate on the "Seg. Error" observed for the LUT-GEMM baseline in Figure 4. Is this a known bug in the work you are citing [53], or is it an artifact of your implementation? How can readers be confident in your speedup claims if the baseline is unstable?
Provide a quantitative analysis of the trade-offs of your proposed M2N64K4 tiling shape. Specifically, what is the area and power overhead of the broadcast network required for activations compared to a more square tile like 16x16x2?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents LUT Tensor Core, a software-hardware co-design to accelerate mixed-precision GEMM (mpGEMM) operations, which are central to low-bit LLM inference. The authors identify that current hardware is ill-suited for mpGEMM (e.g., INT4 weights x FP16 activations), forcing inefficient dequantization-based workarounds. The proposed solution revives the lookup table (LUT) approach, but systematically addresses its traditional bottlenecks—namely table precomputation overhead and storage costs.

The core contribution is a holistic, full-stack solution. On the software side, they use compiler techniques like operator fusion to eliminate precomputation overhead and leverage weight reinterpretation to halve table storage. On the hardware side, they propose a simplified, bit-serial Tensor Core with an elongated tiling shape to maximize table reuse and flexibility. This is exposed to programmers via a new LMMA instruction set and integrated into a TVM-based compilation stack. The evaluation shows significant improvements in compute density and energy efficiency over conventional MAC-based designs and a 1.44x improvement over the state-of-the-art LUT-based accelerator, UNPU.

Strengths

A Genuine and Elegant Co-Design: The paper's primary strength is that it truly embodies the principle of software-hardware co-design. Rather than designing hardware in a vacuum, the authors intelligently partition the problem. The "hardware-unfriendly" tasks of table generation and optimization are offloaded to software and the compiler stack (Section 3.1, page 5), which dramatically simplifies the resulting hardware logic. This synergy is the key to their impressive PPA results. The weight reinterpretation to exploit symmetry (Figure 7) is a particularly clever example of this principle.
Addressing a Critical and Timely Problem: The work is situated at the heart of a major challenge in deploying AI: the prohibitive cost of LLM inference. As the community aggressively pushes toward sub-4-bit weight quantization (e.g., BitNet), the mismatch with existing hardware becomes a more severe bottleneck. This paper doesn't just identify the problem; it provides a concrete and well-reasoned architectural proposal to solve it. It offers a compelling alternative to the approach taken by major vendors like NVIDIA, which involves adding an ever-expanding set of dedicated narrow-precision MAC units.
Excellent Contextualization and Motivation: The authors do a superb job of motivating their work. Section 2.3 ("Gaps in Current LUT-based Solutions," page 4) clearly articulates why a naive LUT approach fails, setting the stage perfectly for their contributions. The paper connects to a rich history of architectural ideas—including bit-serial processing (like Stripes), tile-based DNN compilation (TVM, CUTLASS), and domain-specific accelerators—and synthesizes them into a novel solution for a modern problem.
Strong Empirical Analysis: The multi-level evaluation, from dot-product unit microbenchmarks (Section 4.2.1) to kernel-level simulations (Section 4.3) and detailed comparisons against prior work (Section 4.5), is comprehensive. The head-to-head comparison with UNPU (Table 2, page 11) is particularly valuable, as it grounds their claims against a known SOTA design. The ablation study in Table 2, showing the incremental benefit of each optimization, is excellent and clearly demonstrates the value of their co-design approach.

Weaknesses

Reliance on a High-Level Simulator for End-to-End Results: While understandable due to the infeasibility of using Accel-Sim for full models, the reliance on a custom, tile-based simulator for the end-to-end results (Section 4.4, page 8) introduces a degree of uncertainty. Although validated against real hardware (Figure 16), such simulators can miss subtle but important second-order effects related to memory contention, pipeline stalls, or control overhead. The impressive end-to-end speedups should be interpreted with this context in mind.
Limited Comparison with Emerging Native Hardware Support: The paper acknowledges the trend of native mpGEMM support in emerging architectures like NVIDIA's Blackwell (Section 6, page 12). However, the discussion is largely qualitative. A more in-depth analysis, even if theoretical, comparing the trade-offs of the flexible LUT-based approach versus a hardware architecture with dedicated FP4/FP6 MAC units would be highly insightful. For example, how does the area and power of the proposed LUT Tensor Core compare to an iso-throughput array of native MXFP4 MACs? This would help position the work more firmly within the future architectural landscape.
On-Chip Network Implications of Table Broadcasting: The software optimizations rely on a "precompute-once, broadcast-many" model. While this amortizes the computation, it introduces on-chip communication traffic as the generated LUTs must be distributed from a central compute resource (like vector units) to all the LUT Tensor Cores that need them. In a large-scale system with many such cores, this broadcast traffic could become a performance or power bottleneck. This aspect is not fully explored in the paper.

Questions to Address In Rebuttal

Could the authors elaborate on the potential sources of inaccuracy in their end-to-end simulator? Specifically, what architectural effects (e.g., interconnect contention, cache coherence for shared LUTs) are abstracted away, and how might these impact the reported performance gains on a real-world system-on-chip?
Could you provide a more direct comparative discussion on your LUT-based approach versus the path of adding dedicated narrow-precision format units, as seen in recent commercial GPUs? What are the fundamental trade-offs in terms of architectural flexibility (e.g., supporting non-standard formats like ternary weights), area efficiency, and design complexity?
Regarding the precomputation dataflow, have you analyzed the memory bandwidth and on-chip network traffic required to broadcast the precomputed tables to the distributed LUT Tensor Cores? At what scale (i.e., number of cores) might this communication become a limiting factor for performance or energy?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents LUT Tensor Core, a software-hardware co-design for accelerating low-bit Large Language Model (LLM) inference. The central problem addressed is the inefficiency of mixed-precision matrix multiplication (mpGEMM), where low-precision weights are multiplied by higher-precision activations. The authors propose a lookup table (LUT)-based approach that they claim overcomes the limitations of prior software and hardware LUT designs.

The core of their claimed contribution is a three-part co-design: 1. Software Optimizations: Employing operator fusion to absorb the LUT precomputation overhead, and using weight reinterpretation to exploit numerical symmetry, thereby halving the LUT storage requirements. 2. Hardware Architecture: A simplified, bit-serial LUT-based Tensor Core design that leverages an "elongated" tiling shape (high N, low M, low K) to maximize table reuse. 3. Compiler/ISA Support: A new LMMA instruction to expose the hardware to a tile-based deep learning compiler.

Strengths

The primary strength of this work lies not in the invention of a single new component, but in the holistic synthesis and intelligent partitioning of responsibilities between software and hardware. The authors correctly identify that naive hardware-centric LUT designs suffer from excessive overhead. By moving complex or redundant tasks—such as table precomputation and exploiting weight properties—into the software and compilation stack, they achieve a significant simplification and efficiency gain in the hardware itself.

The most notable element is the weight reinterpretation for table symmetrization detailed in Section 3.1.2 (page 5). While exploiting symmetry is a classic optimization principle, its specific application here to map unsigned integer weights to a symmetric signed representation to halve the LUT size and associated broadcast/multiplexer hardware is a clever and effective method. This re-partitioning demonstrates a clear co-design benefit: a software-level transformation directly enables a more compact and efficient hardware implementation.

Weaknesses

My primary concern is that the paper's claims of novelty are overstated, as the work is fundamentally an integration of several well-established techniques from different domains. The contribution appears to be more of a strong engineering effort in system integration rather than the introduction of fundamentally new concepts.

Constituent Techniques are Not Novel: The paper builds its "novel co-design" on a foundation of existing ideas:
- Bit-Serial Processing: The use of a bit-serial datapath to handle flexible precisions (Section 3.2.1, page 6) is a well-known technique for area and energy efficiency in DNN accelerators. This dates back to work like Stripes (Judd et al., MICRO 2016), which the authors cite as [27]. Applying this to a LUT-based unit is a logical extension, not a conceptual breakthrough.
- Operator Fusion: The fusion of the LUT precomputation kernel with preceding element-wise operations (Section 3.1.1, page 5) is a standard compiler optimization implemented in virtually all modern deep learning frameworks (including TVM, which they use). Its application here is expected, not novel.
- LUT-based Computation: The core idea of using LUTs to accelerate DNNs, especially with low-bit weights, is not new. The authors themselves cite prior art such as UNPU [38] and Biqgemm [26].
- Design Space Exploration for Tiling: The discovery of an "elongated" tiling shape (Section 3.2.2, page 6) is the result of a design space exploration (DSE). DSE is a standard methodology in hardware architecture. While the resulting insight is valuable for this specific design, the method is not novel, and optimized tiling shapes are a cornerstone of high-performance libraries like CUTLASS.
The "Delta" Over Prior Art is Incremental: The novelty rests almost entirely on the synthesis of these known techniques. The authors compare against UNPU [38] in Table 2 (page 11), showing a 1.44x improvement. This improvement is achieved by combining their optimizations. However, it is not clear that this represents a fundamental conceptual leap. For instance, the weight reinterpretation trick is clever, but it is an algebraic simplification that could, in principle, be applied to other LUT-based designs. The paper does not convincingly argue why this specific combination of known methods was non-obvious or paradigm-shifting. The work appears to be a very competent optimization of the LUT-based accelerator paradigm, rather than a reinvention of it.

Questions to Address In Rebuttal

The central claim is a "software-hardware co-design." However, the constituent parts (bit-serial processing, operator fusion, LUTs for DNNs, DSE for tiling) are all established techniques. Please articulate precisely what is the fundamental novel concept in your co-design, beyond the successful application and integration of these known methods to the mpGEMM problem.
The weight reinterpretation to exploit symmetry (Section 3.1.2) is the most compelling part of the software optimization. Was this technique previously proposed in the context of LUT-based accelerators for general mpGEMM (beyond simple binary/ternary networks)? Please clarify the delta between your method and standard techniques for handling signed/unsigned numbers in other computational paradigms.
The paper identifies an elongated tile shape (e.g., M2N64K4) as optimal. How general is this finding? Is this shape universally optimal for LUT-based mpGEMM, or is it an artifact of the specific activation (e.g., FP16/INT8) and weight (e.g., INT1/2/4) bit-width ratios evaluated? How does the optimal M:N:K ratio change as the activation bit-width approaches the weight bit-width?
The comparison with the "Conventional LUT" implementation in Figure 13 (page 7) suggests it has poor area scaling. What specific design assumptions were made for this "Conventional LUT" baseline? Does it also use bit-serial processing or is it a fully parallelized design, which would explain its large area? Clarifying this is crucial to fairly assess the novelty and benefit of your proposed hardware design.

AiF: Accelerating On-Device LLM Inference Using In-Flash Processing

Abstract

While large language models (LLMs) achieve remarkable performance across diverse application domains, their substantial memory demands present challenges, especially on personal devices with limited DRAM capacity. Recent LLM inference engines have ...

ACM LINK

Reviews

Review 1

Paper Title: AiF: Accelerating On-Device LLM Inference Using In-Flash Processing Reviewer: The Guardian

Summary

This paper proposes "Accelerator-in-Flash" (AiF), an in-flash processing (IFP) architecture designed to accelerate on-device Large Language Model (LLM) inference. The core contribution is a pair of flash-level optimizations: 1) "charge-recycling read" (cr-read), a technique to reduce latency for sequential wordline accesses, and 2) "bias-error encoding" (be-enc), a Vth state reconfiguration scheme to improve the reliability of pages storing LLM parameters. The authors claim that these techniques enable a 4x increase in internal flash bandwidth, which, when integrated into a full system, results in a 14.6x throughput improvement over baseline SSD offloading and even a 1.4x improvement over a high-end, in-memory (DRAM) system. The evaluation is conducted using a modified version of the NVMeVirt SSD emulator integrated with the llama.cpp inference engine.

Strengths

Well-Motivated Problem: The paper correctly identifies the memory capacity and bandwidth limitations as critical barriers to on-device LLM deployment. The analysis in Section 3, particularly Figure 4, effectively establishes the performance bottlenecks and reliability requirements that motivate the work.
Conceptually Sound Primitives: The two core technical ideas are well-conceived. The concept of cr-read (Section 4.2) leverages the known spatial locality of model parameter access to optimize the flash read sequence. Similarly, be-enc (Section 4.3) is a clever application of reconfiguring Vth states to create a heterogeneous reliability profile within a single TLC block, prioritizing the LLM data.
Holistic Approach: The work commendabley attempts to address both performance (bandwidth via cr-read) and reliability (error rates via be-enc) simultaneously. This is a crucial and often overlooked aspect of practical IFP system design.

Weaknesses

My primary concerns with this work lie in the significant gap between the proposed concepts and the evidence provided to support their practicality, scalability, and system-level viability. The evaluation relies on a chain of optimistic assumptions and limited validations that undermine the paper's extraordinary claims.

Validation of cr-read is Fundamentally Insufficient: The claim that cr-read is functionally sound and yields a 2.8x bandwidth improvement (Section 4.2.2, page 7) rests on SPICE simulations and experiments on a "fabricated CTF cell array" of 9x9 WLs/BLs (footnote 5, page 6). A 9x9 array is a laboratory-scale toy, not a proxy for a modern 3D NAND device with thousands of wordlines, complex peripheral circuitry, and significant parasitic capacitances. Real-world phenomena like read disturb, wordline-to-wordline coupling, and thermal effects, which are critical at scale, are not captured. Presenting this as sufficient validation for a production-level technique is a significant overstatement.
System-Level Side Effects of be-enc are Ignored: The be-enc scheme (Section 4.3) introduces performance and reliability asymmetry, degrading the non-LSB pages used for "general data." The paper dismisses this impact with a cursory analysis in Figure 18 (page 12), showing only a minor drop in random read IOPS. This analysis is critically incomplete. It fails to address:
- Garbage Collection (GC) and Write Amplification: How does a flash translation layer (FTL) manage blocks containing both high-endurance LSB pages (for LLM data) and low-endurance CSB/MSB pages (for general data)? This heterogeneity would drastically complicate wear-leveling and GC, likely leading to increased write amplification and premature device wear, none of which is modeled or discussed.
- Write/Erase Performance: The paper only analyzes the read path. Does reconfiguring the Vth states for be-enc impact program/erase times or program disturb characteristics? This is not mentioned.
The Reliability Premise of ECCLITE is Fragile: The entire justification for the lightweight ECCLITE decoder hinges on the characterization in Figure 13(b) (page 8), which shows a maximum of 9 bit errors per 1-KiB for be-enc LSB pages. The proposed ECCLITE corrects up to 10 bits. This leaves a safety margin of a single bit error. In real-world flash deployment, error rates are a distribution, not a fixed maximum. This razor-thin margin provides virtually no resilience against process variation, variable retention times, or higher-than-expected cell degradation near the device's end-of-life. A robust system would require a much larger ECC margin.
The Evaluation Model for Concurrency is Overly Optimistic: The claim of surpassing in-memory performance (1.4x) is predicated on the parallel host-AiFSSD execution model (Figure 15c, page 9). The NVMeVirt-based evaluation (Section 6.1) simulates the delay of the AiFSSD computation. However, it is highly unlikely that this model accurately captures the true overhead of the proposed fine-grained, tightly-coupled interaction. Each GEMV offload requires command submission, potential context switching, interrupt handling, and DMA setup/teardown. In a real system, this control-plane overhead for frequent, small operations could easily dominate the data-plane latency, nullifying the benefits of parallelism. The simulation appears to model an idealized best-case scenario.
Contradictory Claims on Bottlenecks: The authors motivate the work by claiming memory bandwidth is the primary bottleneck. However, their own scalability analysis (Figure 17b, page 12) shows sub-linear performance scaling when doubling or quadrupling internal bandwidth. They attribute this to "multiple vector arithmetic operations" and "NVMe protocol overheads." This is a crucial admission that their solution only addresses a part of the problem, and that other components (left on the host) become the new bottleneck. This weakens the central thesis that simply maximizing internal flash bandwidth is the definitive solution.

Questions to Address In Rebuttal

The authors must provide convincing evidence and clarification on the following critical points:

Regarding cr-read Validation: How can the results from an 81-cell array be extrapolated to a commercial multi-Gb flash die? Please provide analysis or data on how cr-read would behave in the presence of scaled-up parasitics, read disturb, and other real-world, dynamic flash array effects.
Regarding be-enc System Impact: Provide a detailed analysis of how be-enc would interact with a modern FTL. Specifically, quantify its impact on garbage collection efficiency, write amplification, and the endurance of blocks that mix IFP and general-purpose data.
Regarding ECCLITE Reliability: Please justify the decision to use an ECC scheme with only a 1-bit error correction margin over your characterized maximum. What is the expected failure rate of this scheme at the device's certified P/E cycles and retention period, considering statistical variations in error rates?
Regarding the Concurrency Model: Can you provide evidence that your NVMeVirt timing model accurately accounts for the full software stack and hardware overhead (interrupts, DMA, protocol overhead) of the frequent host-SSD synchronization required by your parallel execution scheme? Please break down the latency components of a single offloaded GEMV operation in your model versus a real system.
Regarding Performance Scaling: Given the admitted sub-linear scaling, what is the theoretical peak throughput of the AiF system as internal bandwidth approaches infinity? At what model size do the non-GEMV operations left on the host become the primary bottleneck, rendering further flash acceleration ineffective?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the critical challenge of running large language models (LLMs) on personal devices with limited DRAM. The authors identify the primary bottleneck of existing SSD offloading techniques: the low external read bandwidth of flash storage, which severely limits token generation rates for memory-bound LLM inference.

The core contribution is Accelerator-in-Flash (AiF), an in-flash processing (IFP) architecture that moves the dominant matrix-vector multiplication (GEMV) operations directly into the NAND flash chips. This approach is designed to leverage the immense aggregate internal bandwidth of the chips, bypassing the external channel and PCIe bottlenecks. The novelty of this work lies not just in proposing IFP for AI, but in developing two co-designed, flash-aware techniques that make it practical for the unique demands of LLMs: 1. Charge-Recycling Read (cr-read): A novel read command that speeds up sequential reads within a block by eliminating redundant precharge and discharge steps, boosting effective read bandwidth. 2. Bias-Error Encoding (be-enc): A clever VTH state encoding scheme that makes one page type (LSB) exceptionally fast and reliable at the cost of others, allowing LLM parameters to be stored with high fidelity while enabling a highly compact on-chip error correction engine (ECCLITE).

Through a comprehensive evaluation using a full-system simulator integrated with the llama.cpp engine, the authors demonstrate that AiF can provide a 14.6x throughput improvement over conventional SSD offloading and even surpass a high-end in-memory system by 1.4x, all while significantly reducing the host memory footprint.

Strengths

High Potential Impact and Significance: This work addresses a timely and significant problem. The inability to run large, capable LLMs locally on consumer devices is a major barrier to privacy, low latency, and offline accessibility. AiF presents a compelling hardware-based solution that could fundamentally alter the landscape of edge AI. Rather than an incremental improvement, this work proposes a system that could enable models (e.g., 40B+ parameters) that are currently infeasible on personal devices to run at interactive speeds. This is exactly the kind of ambitious, problem-driven research the community needs.
Excellent Cross-Layer Co-Design: The true strength of this paper lies in its vertical integration of insights from the application layer down to the device physics layer. The authors recognize that LLM inference is (a) dominated by GEMV on static weights and (b) highly sensitive to bit errors (as shown in Figure 4b, page 4). Instead of a generic accelerator, they have designed cr-read and be-enc by fundamentally rethinking the NAND flash read protocol itself to meet these specific requirements. This deep understanding, connecting algorithmic needs to the manipulation of VTH states (Section 4.3, page 7) and read sequence timing (Section 4.2, page 5), is the hallmark of outstanding systems research.
Holistic and Plausible System Integration: The authors go beyond the core accelerator idea and consider the full system stack. They outline the necessary extensions to the NVMe protocol (aif_post, aif_gemv), discuss the system software and application-level requirements (Section 5.2, page 9), and even architect a parallel execution model to overlap host and AiFSSD computation (Figure 15c, page 9). The evaluation, which uses a modified full-system SSD emulator (NVMeVirt) and a real inference engine (llama.cpp), lends significant credibility to their performance claims. This end-to-end thinking makes the proposed system feel less like a theoretical concept and more like a blueprint for a real-world product.

Weaknesses

While the core ideas are strong, the paper could benefit from addressing the following points, which are less flaws in the work itself and more contextual limitations.

Practical Path to Adoption: The proposed changes are significant, requiring modification to both the internal logic of NAND flash chips and the firmware of the SSD controller. This represents a substantial re-architecting effort for semiconductor manufacturers who are typically conservative. While the paper argues for the feasibility of the required changes (e.g., modifying timer codes, leveraging existing dynamic VTH configuration), the business and engineering inertia to implement such a specialized feature is immense. The work would be stronger if it acknowledged this difficult path to commercialization more directly.
Handling Dynamic Data and Competing Workloads: The proposed system is exquisitely optimized for the "write-once, read-many" nature of LLM parameters stored in dedicated "IFP blocks." However, modern SSDs are general-purpose devices. The paper mentions that garbage collection (GC) in a footnote (footnote 9, page 10) but does not fully explore the performance implications. If a user is performing heavy I/O operations concurrently with LLM inference, how would background SSD tasks like GC and wear-leveling, which might need to move IFP blocks, impact inference latency and consistency? This interaction between the specialized "compute" workload and general storage tasks is a critical aspect of any practical in-storage processing system.
Limited Operational Scope: The work rightly focuses on GEMV as the primary bottleneck. However, other operations in an LLM (e.g., Softmax, LayerNorm) still run on the host, requiring data to move back and forth. The parallel execution scheme (Section 5.1) mitigates this, but as model architectures evolve, the Amdahl's Law effect from the non-accelerated portions could become more significant. This is not a weakness of the current work but a natural question about its extensibility and future-proofing.

Questions to Address In Rebuttal

Regarding Practical Implementation: Can the authors elaborate on the non-recurring engineering (NRE) cost and complexity for a NAND manufacturer to implement cr-read and be-enc? Are these changes that could be accomplished primarily through microcode/firmware patches to the on-chip scheduler and voltage generators, or do they require significant redesign of the analog peripheral circuitry?
Regarding System Interference: Could you elaborate on the performance impact of background garbage collection on a running LLM inference task? If an IFP block becomes the target of a GC operation, would this introduce significant, unpredictable latency (i.e., jitter) into the token generation pipeline? How does your proposed system manage or isolate the compute-dedicated blocks from the effects of general storage I/O?
Regarding Model Management: The aif_post command is used to initially place the model into the optimized layout. What is the performance overhead of this setup process for a large model (e.g., a 70B parameter model)? How does the system envision handling model updates or switching between different fine-tuned versions of a model, which are common user scenarios?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper proposes Accelerator-in-Flash (AiF), an in-flash processing (IFP) architecture designed to accelerate on-device large language model (LLM) inference. The authors correctly identify that existing SSD offloading techniques are bottlenecked by external I/O bandwidth and that prior IFP solutions fail to meet the unique high-bandwidth and high-reliability demands of LLMs. The core of the proposed contribution is not the general idea of IFP, but rather two specific, low-level flash read techniques designed to make IFP practical for this workload. The first, "charge-recycling read" (cr-read), modifies the NAND read sequence to skip precharge and discharge steps for sequential reads within a block, boosting read speed. The second, "bias-error encoding" (be-enc), reconfigures the threshold voltage (VTH) state mapping in TLC NAND to create an ultra-reliable and fast LSB page type, where LLM parameters are exclusively stored. This increased reliability allows for a lightweight on-chip ECC decoder (ECCLITE), mitigating the area and power overhead that would otherwise make IFP infeasible. The authors evaluate this system via simulation, demonstrating significant throughput gains over both baseline SSD offloading and in-memory inference.

Strengths

The primary strength of this work lies in its specific, non-obvious proposals for modifying the physical operation of NAND flash to serve a high-level application need. The novelty is not in the high-level concept but in the enabling mechanisms:

Novel Co-design: The paper presents a compelling cross-layer co-design. Instead of simply positing an accelerator inside a flash chip, the authors have identified the fundamental physical limitations (read latency, error rates, ECC cost) and proposed concrete solutions (cr-read, be-enc) at the device level. The tight coupling between be-enc reducing error rates and ECCLITE reducing hardware cost is particularly novel and well-conceived.
Specific Low-Level Contributions: Both cr-read and be-enc appear to be novel in their specific formulation and application. While the principle of optimizing sequential access or leveraging the differing reliability of pages in MLC/TLC flash may have conceptual precedents, the proposed mechanisms are distinct:
- cr-read (Section 4.2, page 5) as a specific modification to the read state machine (bypassing precharge/discharge) is a clever circuit-level optimization.
- be-enc (Section 4.3, page 7) goes beyond passively using the LSB page; it proposes actively reconfiguring the VTH state encoding (from (2,3,2) to (1,3,3) coding) to intentionally create a privileged, high-performance page type specifically for IFP data. This is a significant conceptual step beyond prior work that merely partitions data based on existing page characteristics.
Problem-Driven Innovation: The work is well-motivated. The authors clearly establish in Section 3.3 (page 4) why existing IFP is insufficient for LLMs, pointing to the dual challenges of raw bandwidth and the prohibitive cost of robust on-chip ECC. Their proposed solutions directly and elegantly address these two specific, well-articulated problems.

Weaknesses

From a novelty perspective, the primary weakness is that the paper frames its contribution under the broad umbrella of "In-Flash Processing," a concept with extensive prior art. The true novelty is more nuanced and lies deep within the flash controller and cell programming logic, which could be emphasized more clearly.

Incremental Nature of cr-read: The core idea of cr-read—reusing charged line voltages to accelerate subsequent accesses to a physically local region—is a common optimization principle in hardware design. While its application to the NAND read sequence is new in this context, it feels like an incremental, albeit clever, engineering optimization rather than a fundamentally new concept. The paper would be strengthened by a more thorough comparison to any existing "fast sequential read" modes or similar optimizations that may exist in NAND flash manufacturer datasheets or patents, which often contain proprietary, non-public access modes.
Overstated Novelty of the General Approach: The paper's narrative implies that IFP for ML is a new direction. However, as the authors' own related work section (Section 7, page 12) points out, numerous works have explored in-storage and in-flash processing for DNNs and other data-intensive workloads ([38], [45], [78]). The key distinction of AiF is its focus on LLM inference and the specific physical-level techniques to overcome the associated challenges, a distinction that should be made more central to the paper's claims.

Questions to Address In Rebuttal

Prior Art for cr-read: Can the authors elaborate on the novelty of the cr-read technique relative to prior art in low-level flash memory operation? Are there functionally similar "burst" or "fast sequential" read modes that have been proposed or implemented by memory manufacturers, even if not for the purpose of general-purpose computation? The concept of skipping reset phases for sequential operations is not new in principle; please clarify what makes this specific application to the NAND read sequence fundamentally novel.
Prior Art for be-enc: The concept of pages within a multi-level cell exhibiting different reliability characteristics is well-established. Prior work has proposed leveraging this by, for example, placing critical metadata on more reliable pages. The core novelty of be-enc seems to be the active reconfiguration of VTH levels to further enhance the reliability of LSB pages for IFP workloads. Can you confirm if this specific idea—dynamically changing the VTH encoding scheme on a per-block basis to create a privileged data partition for an IFP accelerator—has been proposed before?
Complexity vs. Practicality: The proposed modifications, especially cr-read, require altering the fundamental state machine of a NAND flash chip's read path. This represents a significant deviation from the standard ONFI/Toggle interface and would require deep co-design with a NAND manufacturer. Given the extremely high cost and risk associated with modifying this core IP, is the proposed solution a purely academic exploration, or do the authors see a realistic path to adoption? The justification for novelty must also consider the feasibility of the proposed complexity.

LIA: A Single-GPU LLM Inference Acceleration with Cooperative AMX-Enabled CPU-GPU Computation and CXL Offloading

Abstract

The limited memory capacity of single GPUs constrains large language model (LLM) inference, necessitating cost-prohibitive multi-GPU deployments or frequent performance-limiting CPU-GPU transfers over slow PCIe. In this work, we first benchmark recent ...

ACM LINK

Reviews

Review 1

Reviewer: The Guardian

Summary

This paper introduces LIA, a framework for accelerating single-GPU LLM inference by leveraging cooperative computation between a GPU and a modern, AMX-enabled Intel CPU. The central thesis is that recent advancements in CPU matrix multiplication capabilities, specifically Intel's AMX, are significant enough to fundamentally alter the trade-offs in CPU-GPU offloading strategies. The authors propose an offloading algorithm that systematically determines which model sublayers to execute on the CPU versus the GPU to minimize latency or maximize throughput. Additionally, they introduce a CXL-based memory offloading policy to expand capacity for throughput-driven scenarios. The paper presents substantial performance improvements over existing frameworks like FlexGen and a CPU-only baseline (IPEX).

While the paper addresses a timely and important problem, its conclusions rest on a series of questionable methodological choices, most notably the comparison against baselines that are not optimized for the very hardware LIA is designed to exploit. The significant performance claims are therefore a conflation of algorithmic novelty and an underlying hardware advantage that is not fairly accounted for in the baseline comparison.

Strengths

Timely Exploration of New Hardware: The paper provides a valuable and, to my knowledge, one of the first comprehensive microbenchmark analyses of Intel's AMX performance for realistic LLM workloads (GEMM and GEMV). The characterization in Section 4 and Figure 5 is an important contribution to the community, quantifying the substantial leap from AVX512 and situating AMX performance relative to several generations of NVIDIA GPUs.
Systematic Offloading Policy: The formulation of the compute-offloading decision as an optimization problem (Section 5.1) is methodologically sound. The cost model, while simple, holistically considers key system parameters like compute throughput and memory/PCIe bandwidths to derive an optimal policy, moving beyond the heuristic-based approaches of prior work.
Well-Reasoned CXL Integration: The memory-offloading policy for CXL is logical and well-motivated. The authors correctly identify that for parameter transfers to the GPU, CXL bandwidth can be sufficient to hide behind the PCIe bottleneck (Observation-1, Section 6), while simultaneously recognizing that CXL's high latency is detrimental to CPU-bound computation, leading to their policy of keeping the KV cache in DDR.

Weaknesses

Fundamentally Unfair Baseline Comparison: The central weakness of this paper is the evaluation against inappropriate baselines. The primary competitor, FlexGen, is designed for CPUs with AVX instruction sets. The authors evaluate it on a Sapphire Rapids CPU but do not appear to have modified it to utilize AMX. Consequently, the reported speedups of up to 5.1x-19x are not solely attributable to LIA's "cooperative framework" but are heavily skewed by comparing an AMX-native framework to an AVX-based one. The paper fails to disentangle the gains from their scheduling algorithm versus the raw hardware advantage. A fair comparison would require an AMX-enabled version of FlexGen or another strong baseline that also leverages AMX. The comparison to IPEX (CPU-only) is a strawman for any model that could even partially fit on a GPU.
Over-reliance on Analytical Models and Simulation: A significant portion of the paper's results, particularly for large batch sizes and throughput-oriented scenarios, are derived from an analytical model rather than direct measurement (as indicated by stars in Figure 11, page 10). The authors state this model has an "average error of 12%" (Section 7, page 8), which is a non-trivial margin of error that casts doubt on the precision of the claimed speedups. Furthermore, the multi-GPU comparison in Section 7.8 relies on comparing their system (partially evaluated with their own model) against a simulation of a DGX-A100 system. Conclusions drawn from comparing a model to a simulation are speculative at best.
Unsupported Claims of Generalizability: In Section 7.7, the authors claim LIA's optimizations generalize to other model architectures like GPT, Llama2, and Bloom. This claim is substantiated only by results from their analytical model, not empirical evidence. Performance characteristics can vary significantly across model families due to differences in layer normalization, activation functions, and attention mechanisms. Without measured data on real hardware for at least one other major model family, this claim is unsubstantiated and constitutes an overstatement of the work's demonstrated contributions.
Ambiguity Regarding Software Stack Maturity: Footnote 4 on page 5 states that "the recently-introduced AMX libraries are less optimized." This is a critical caveat that is buried and not discussed in the main text. It raises serious questions about the replicability and robustness of the foundational microbenchmarks in Figure 5. If the libraries are immature, the reported performance may not be representative of the hardware's true capability, or conversely, the gains over AVX512 could be even larger. This uncertainty undermines the foundation upon which the entire paper is built.

Questions to Address In Rebuttal

Please justify the choice of an AVX-based FlexGen as the state-of-the-art baseline on an AMX-enabled processor. Did you make any attempt to build a stronger, AMX-aware baseline to ensure a fair comparison? If not, how can you deconvolve the performance gains of your algorithm from the underlying hardware performance differential between AMX and AVX?
The paper relies heavily on an analytical model for many key results. Can you provide a sensitivity analysis for this model? How do the model's predictions change with variations in its core assumptions (e.g., PCIe bandwidth, CPU memory latency)? Why were direct measurements not feasible for the starred configurations in Figure 11?
To support the claim of generalizability (Section 7.7), please provide measured, end-to-end performance data for LIA on at least one non-OPT model, such as Llama2-70B, and compare it against FlexGen on the same model.
Regarding Footnote 4, please elaborate on the maturity of the AMX software stack used in your evaluation. How might the presented results and the optimal offloading policies change as these libraries mature and AMX performance presumably improves? Does this not risk making your derived "optimal" policies obsolete?

Review 2

Reviewer Persona: The Synthesizer (Contextual Analyst)

Summary

This paper presents LIA, a framework for accelerating Large Language Model (LLM) inference on a single GPU by leveraging cooperative computation with modern, powerful CPUs. The work is built upon a crucial and timely insight: the introduction of specialized matrix multiplication hardware in recent Intel CPUs (Advanced Matrix Extensions, or AMX) has fundamentally altered the performance landscape, making the CPU a viable computational partner rather than just a passive memory host.

The authors first provide a rigorous performance characterization of AMX-enabled processors (Sapphire Rapids and Granite Rapids), demonstrating that their matrix math throughput is competitive with some GPUs and a significant fraction of modern ones. Building on this, LIA introduces a systematic compute-offloading algorithm that partitions LLM sublayers between the CPU and GPU to optimize for end-to-end latency or throughput. Finally, the paper explores the use of CXL memory to cost-effectively expand system memory for large-batch, throughput-driven scenarios, proposing a simple but effective memory tiering policy. The experimental results show substantial improvements in latency, throughput, and energy efficiency over existing CPU-only and CPU-GPU collaborative frameworks.

Strengths

Central Thesis is Timely and Insightful: The core contribution of this work lies in identifying and exploiting a major shift in the hardware landscape. For years, the systems community has treated CPUs in ML workloads as slow, general-purpose engines, primarily useful for their large memory capacity. This paper compellingly argues that this assumption is now obsolete due to accelerators like AMX. The microbenchmarks presented in Section 4 (pages 4-5) provide the crucial evidence for this claim and serve as a strong foundation for the entire paper. This is an excellent example of systems research that is deeply connected to hardware evolution.
Holistic and Well-Designed System: LIA is more than just a proof-of-concept for AMX. The authors have designed a complete system that addresses the problem holistically. The formulation of the offloading decision as an optimization problem (Section 5.1, page 6) is principled and provides a generalizable framework. The inclusion of system-level optimizations like efficient GPU memory usage and overlapping communication (Section 5.2, page 6) shows attention to practical implementation details. The integration of CXL (Section 6, page 7) is forward-looking and addresses the next major bottleneck: memory capacity for large-batch inference.
Connects Disparate Hardware Trends: A key strength of this work is its ability to synthesize two independent but complementary hardware trends: the rise of on-CPU acceleration (AMX) and the emergence of disaggregated, tiered memory (CXL). The paper demonstrates a synergistic relationship where AMX makes the CPU a more valuable compute resource, and CXL provides the memory capacity needed to feed it in high-throughput scenarios. This provides a powerful architectural blueprint for future cost-effective inference servers.
Significant and Well-Documented Performance Gains: The results are not merely incremental. Achieving up to 5.1x higher throughput and 19x lower latency compared to the state-of-the-art single-GPU offloading framework (FlexGen) is a very strong result. This clearly validates that rethinking the role of the CPU is not just an academic exercise but a source of major real-world performance improvements.

Weaknesses

While the work is strong, its framing could be broadened to better contextualize its contributions within the wider landscape of heterogeneous computing.

Implicit Dependency on High-End CPUs: The paper's success hinges on the availability of top-tier Intel Xeon processors with a high core count. While this is the correct platform to demonstrate the maximum potential, it leaves open the question of where the "break-even" point is. A discussion on the performance sensitivity to CPU core count and AMX capabilities would help readers understand the cost-performance trade-offs more broadly. The current work presents an architecture that pairs a high-end GPU with a high-end CPU, which is still a significant investment.
Limited Discussion on Architectural Generality: The paper is naturally focused on the Intel AMX + NVIDIA GPU ecosystem. However, the core idea of a powerful CPU partner is not unique to Intel. AMD is developing similar capabilities, and the ARM world has its own vector and matrix extensions (SVE, SME). The discussion of Grace Hopper in Section 8 (page 12) is a good start, but the paper would be even more impactful if it framed its optimization framework (Equation 1, page 6) as a general model for heterogeneous systems, where AMX is just one instance of a powerful host processor.
Static Memory Tiering Policy: The proposed CXL policy—storing model parameters on CXL and intermediate values (like the KV cache) on DDR—is pragmatic and effective. However, it is a static policy. This is a missed opportunity to discuss the potential for more dynamic, access-pattern-aware data placement, which is a rich area for future research that this work directly enables.

Questions to Address In Rebuttal

Could the authors comment on the sensitivity of LIA's performance to the CPU's core count? The experiments use high-end 40-core and 128-core CPUs. How would the optimal offloading policy and resulting performance change on a lower-end, say 16-core, AMX-enabled server CPU? This is critical for understanding the true cost-efficiency of this approach.
The optimization framework in Section 5.1 is based on performance parameters of the specific CPU and GPU. How portable is this framework conceptually? If one were to build a similar system using an AMD CPU with its own on-chip AI engine, would the framework apply directly with only new performance measurements, or would it require fundamental changes?
The paper makes a strong case for cost-efficiency against multi-GPU setups. Could you provide a brief TCO (Total Cost of Ownership) sketch? A modern H100 GPU and a Granite Rapids CPU still represent a very expensive server. How does the estimated cost per token served compare to, for example, a server with two last-generation A100 GPUs, which might have a similar acquisition cost?

Review 3

Review Form: The Innovator (Novelty Specialist)

Summary

This paper introduces LIA, a framework for accelerating single-GPU LLM inference for models that exceed GPU memory. The authors' contribution is composed of three primary elements. First, they provide a timely performance characterization of Intel's Advanced Matrix Extensions (AMX) on recent CPUs, establishing that modern CPUs are computationally powerful enough to be viable partners in LLM inference, not just data hosts. Second, leveraging this insight, they propose a novel, systematic compute-offloading algorithm. This algorithm formulates the decision of which Transformer sub-layers to execute on the CPU versus the GPU as an optimization problem, creating a dynamic policy based on batch size and sequence length. This contrasts with prior art that used static, heuristic-based offloading. Third, the paper introduces a specific memory-offloading policy for systems with both DDR and CXL memory, proposing to store model parameters on CXL while keeping latency-sensitive intermediate data (like the KV cache) on DDR to mitigate the performance penalty of CXL on CPU-side computation.

Strengths

The primary strength of this paper lies in its novel reframing of the CPU's role in CPU-GPU cooperative inference. My evaluation identifies the following genuinely new contributions:

Shift from Static to Dynamic Offloading Policy: The central novel idea is the move away from the rigid offloading policies seen in prior work. Frameworks like FlexGen [43] and FastDecode [23] identified a single, structurally-determined "best" candidate for offloading (the attention scoring sublayer) based on a general principle (low arithmetic intensity). LIA's contribution is the formulation of this choice as a formal optimization problem (Equation 1, page 6). The policy vector p and the cost model that considers system parameters, B, and L to find an optimal configuration is, to my knowledge, a new and more sophisticated approach for this problem domain. It correctly intuits that the "best" sublayer to offload is not fixed, but is a function of the workload.
A Motivated, Non-Trivial CXL Memory Policy: The use of CXL for memory expansion is not in itself novel. However, the proposed policy is not a naive extension. The novelty lies in the insight derived from "Observation-2" (page 8), which correctly identifies that the CPU's own compute performance is severely hampered by CXL's higher latency. The resulting policy—store parameters on CXL (primarily for GPU transfer via DMA, which is less latency-sensitive) and keep CPU-accessed data (KV cache) on DDR—is a simple, elegant, and novel solution tailored specifically to the performance characteristics of their cooperative computing model. This is a clear advancement over simply treating CXL as a generic memory pool.
Foundational Insight on Modern CPU Capability: While performance studies are common, the specific analysis in Section 4 (page 4) serves as the novel insight that motivates the entire work. Prior art has operated under the assumption that CPUs are orders of magnitude slower than GPUs, thus limiting offloading to only the most trivial compute tasks. By demonstrating that AMX-enabled CPUs achieve throughput that is competitive with previous-generation GPUs (e.g., SPR-AMX vs. P100/V100 for certain workloads in Figure 5, page 5), the authors provide the foundational evidence needed to justify their more ambitious and dynamic offloading algorithm. This re-evaluation of the hardware baseline is a key part of the paper's novelty.

Weaknesses

While the core ideas are novel in their application, their conceptual origins are not without precedent in adjacent fields. The paper's claims could be strengthened by acknowledging this and positioning the work more precisely.

Framing of the Optimization Model: The concept of using a cost model to partition a workload across heterogeneous processors is a classic problem in systems research. The authors present their algorithm in Section 5.1 (page 6) as a core contribution, which it is. However, the novelty is not in the invention of cost-based scheduling, but in its specific formulation for the unique data flow and computational characteristics of Transformer sub-layers. The paper would be more intellectually honest if it contextualized its model within the broader history of heterogeneous scheduling to more sharply define its specific contribution.
Simplicity of the CXL Policy: The CXL policy, while effective and well-motivated, is ultimately a binary, static decision. It represents a single, clever data placement rule rather than a comprehensive memory management framework. This is not a fatal flaw, but it does limit the scope of the novelty. The contribution is a specific, useful heuristic, not a general theory for tiered memory management in LLM inference.

Questions to Address In Rebuttal

On the Novelty of the Cost Model: Please explicitly differentiate the novelty of your formulation in Section 5.1 (page 6) from the general, well-established field of cost-based task scheduling on heterogeneous systems. What specific characteristics of the LLM inference problem make your model a non-trivial application of these classic ideas?
On the Generalizability of the Offloading Policy: The optimization is performed over a fixed set of six sub-layers derived from the OPT architecture. The novelty of the framework appears tied to this specific structure. How does your method generalize to architectures with fundamentally different layer structures, such as Mixture-of-Experts (MoE) models where the feed-forward network is replaced by a sparse routing mechanism? Is the novel contribution the specific model for OPT, or a more general meta-framework for constructing such models?
On the Limits of the CXL Policy: The proposed CXL policy is a static partitioning of data types (parameters vs. KV cache). Did the authors consider or evaluate more dynamic policies? For example, in a scenario with an extremely long context length, the KV cache itself could become enormous. Would a tiered policy, where older KV cache tokens are migrated from DDR to CXL, be feasible? What are the conceptual boundaries of your proposed policy's novelty?

Enabling Ahead Prediction with Practical Energy Constraints

Abstract

Accurate branch predictors require multiple cycles to produce a prediction, and that latency hurts processor performance. "Ahead prediction" solves the performance problem by starting the prediction early. Unfortunately, this means making the prediction ...

ACM LINK

Reviews

Review 1

Paper Title: Enabling Ahead Prediction with Practical Energy Constraints Reviewer Persona: The Guardian (Adversarial Skeptic)

Summary

The paper addresses the well-known performance bottleneck of multi-cycle branch predictors by proposing an energy-efficient "ahead prediction" mechanism. The authors' core thesis is that the number of "missing history" patterns between the start of a prediction and its required use is far smaller than the theoretical maximum of 2^N. They leverage this insight to modify a TAGE predictor, adding a secondary tag to disambiguate between these few likely patterns. This avoids the exponential energy cost of prior-art ahead predictors which speculatively generate predictions for all 2^N paths. The authors claim their design achieves a 4.4% IPC improvement with only a 1.5x increase in per-prediction energy, which they argue is "very much viable" in contrast to the 14.6x overhead of previous proposals.

Strengths

Strong Core Insight: The paper's primary contribution rests on the analysis presented in Section 3 (page 4), particularly Figure 2. The empirical demonstration that the number of observed missing history patterns is typically very low (1-2) for a given ahead PC and history is a valuable and well-articulated insight. This observation correctly identifies the primary source of inefficiency in prior ahead prediction schemes.
Well-Motivated Problem: The authors effectively motivate the problem with a clear analysis of the performance impact of predictor latency (Figure 1, page 1) and a quantitative argument against the energy scaling of existing ahead prediction solutions. The problem is both timely and significant for high-performance processor design.

Weaknesses

My primary concerns with this paper lie in the overstatement of its practical viability, weaknesses in the evaluation methodology, and an insufficient analysis of the mechanism's potential failure modes.

Insufficient Energy Analysis and Overstated Viability: The central claim of practicality hinges on the 1.5x energy figure. This figure, as described in Section 4.5 (page 7), is derived from Cacti simulations focused primarily on table reads. The analysis appears to downplay the energy cost of the duplicated selection logic. For a 5-bit secondary tag, this implies generating 32 predictions in parallel, requiring 32 sets of comparison and selection logic. While the authors claim this is "significantly less than the table reads," this assertion is not substantiated with synthesis data or a more detailed model. The term "very much viable" (Abstract, page 1) is premature without a more rigorous power analysis. A 1.5x increase in the energy of a component that already consumes 3-4% of total core power is non-trivial and requires stronger justification.
Unconvincing Explanation for Performance Degradations: The results in Figure 12 (page 10) show that several benchmarks, notably gcc, omnetpp, and xalancbmk, experience performance degradation or negligible improvement despite, in some cases, a reduction in MPKI. The authors attribute this to the loss of "beneficial wrong-path prefetching." This is a common but often unsubstantiated explanation for performance anomalies related to branch prediction. The paper provides no direct evidence (e.g., L1/L2 cache miss rates, memory bandwidth analysis) to support this claim. Without such evidence, this explanation is speculative and fails to address whether another unaccounted-for microarchitectural interaction or timing artifact is responsible for the performance loss, thus questioning the general applicability of the proposed technique.
Evaluation on an Overly Aggressive Baseline: The evaluation is performed on a 16-wide out-of-order core (Table 2, page 9). Such a wide machine is exceptionally sensitive to front-end stalls, which will disproportionately inflate the benefits of any mechanism that improves front-end throughput. The 4.4% IPC gain reported is likely an optimistic upper bound. The paper lacks a sensitivity study to show how this benefit scales on more conventional 8-wide or 6-wide cores, where back-end pressures might diminish the relative importance of predictor latency.
Potential Underestimation of Runtime Conflict Issues: The design's efficacy relies on the offline analysis (Section 3) that few missing history patterns exist. However, this static view may not capture dynamic program behavior. During program phase transitions, a burst of previously unseen patterns could emerge, leading to a high degree of conflict on the secondary tag. This would cause entries to be promoted to higher-latency TAGE tables unnecessarily, as noted in Section 4.4 (page 6), effectively polluting the predictor. The analysis in Table 1 shows only a small delta in misprediction rate, which is difficult to interpret without the absolute baseline rates and an analysis of worst-case behavior.

Questions to Address In Rebuttal

Can the authors provide a more detailed breakdown of the energy model from Section 4.5? Specifically, what is the estimated energy contribution of the duplicated selection logic (for 32 parallel predictions) relative to the table read energy, and how was this estimated? How robust is the 1.5x claim to a more complete energy model?
Regarding the performance losses on gcc, omnetpp, and xalancbmk (Figure 12), please provide quantitative evidence to support the "beneficial wrong-path prefetching" hypothesis. For instance, can you show that the baseline configuration exhibits a lower L2 miss rate or reduced memory latency for these benchmarks compared to your proposed design?
How do the reported IPC improvements change on a less aggressive core, for instance, an 8-wide machine with proportionally scaled execution resources? This is critical for understanding the technique's relevance beyond hero-level processor designs.
The argument that conflicts are minimized (Section 4.4) is based on average-case behavior. Have you analyzed the predictor's behavior during pathological cases, such as program phase changes, where the number of active missing history patterns might spike? How quickly does the predictor adapt, and what is the transient performance impact?
Your baseline is a 3-cycle TAGE. The state-of-the-art TAGE-SC-L predictor includes Statistical Corrector (SC) and Loop (L) components. While you argue pipelining them is expensive (Section 5.5), a complete evaluation should compare your ahead-pipelined TAGE against a non-ahead, 3-cycle TAGE-SC-L baseline. What is the performance of TAGE-SC-L relative to TAGE in your framework, and how does your 4.4% gain compare to that?

Review 2

Reviewer Persona: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the well-known and critical problem of multi-cycle branch predictor latency in high-performance processors. While the industry standard multi-level prediction scheme mitigates this, it suffers performance stalls when the simple and overriding predictors disagree. "Ahead prediction," which starts a prediction for a future branch early, is a known alternative but has been considered impractical due to the exponential growth in energy required to speculatively evaluate all 2^N possible paths for N skipped intermediate branches.

The authors' core contribution is an empirical observation that fundamentally challenges this assumption of exponential complexity. They demonstrate that for a given control flow history, the number of unique "missing history" patterns that actually manifest at runtime is extremely small—often just one or two, rather than the theoretical maximum.

Based on this insight, they propose a practical ahead predictor design. By modifying the TAGE predictor to include a "secondary tag" corresponding to the specific missing history pattern, they can distribute the predictions for the few observed paths across the existing TAGE tables. This approach avoids the need to read out an exponential number of entries, scaling the per-prediction energy linearly (a 1.5x increase) instead of exponentially (a 14.6x increase in prior proposals). The result is a design that achieves a significant 4.4% IPC improvement, capturing much of the ideal performance benefit of a zero-latency predictor while remaining energy-viable.

Strengths

Reframing a Foundational Problem: The paper's primary strength lies in its elegant reframing of the ahead prediction challenge. Instead of accepting the theoretical 2^N complexity as a given, the authors question its practical relevance. The analysis in Section 3 (particularly Figure 2, page 3) is the heart of the paper, providing a strong empirical foundation for their entire approach. This is an excellent example of how data-driven insights can unlock progress on long-standing architectural problems.
Elegant and Practical Mechanism: The proposed 2-tag TAGE modification is a clever, non-invasive solution. It intelligently leverages the existing structure and allocation logic of a state-of-the-art predictor (TAGE) to handle the new requirement of differentiating missing history paths. This integration makes the idea feel less like a complex new unit and more like a natural evolution of existing hardware, which significantly increases its perceived feasibility.
Bridges the Gap between Theory and Practice: The most compelling aspect of this work is that it takes a powerful academic concept (ahead prediction) that has been largely relegated to the "theoretically interesting but practically infeasible" category and makes it demonstrably practical. The energy analysis presented in Section 4.5 (Figure 7, page 7) is crucial; reducing the energy overhead from a prohibitive 14.6x to a highly palatable 1.5x for a 4.4% performance gain is a fantastic trade-off and the paper's headline result.
An Enabling Technology: Beyond its direct IPC contribution, this work is significant as an enabling technology. By effectively "solving" the predictor latency problem, it relieves a major constraint on front-end design. Architects could be empowered to design even larger and more accurate (and thus higher latency) main predictors, such as the very large ML-based predictors that are emerging, without paying the traditional performance penalty. It also directly boosts the effectiveness of other front-end techniques that rely on running ahead of the execution core, such as advanced instruction prefetching (FDIP, APF) and runahead execution.

Weaknesses

Tight Coupling to TAGE: The proposed mechanism is deeply intertwined with the specifics of the TAGE predictor (its multiple tables, tag-based lookup, and promotion-on-conflict allocation scheme). While TAGE is a dominant design, this tight coupling leaves the generality of the approach in question. It is not immediately clear how the core insight of "tagging by missing history pattern" would be applied to fundamentally different predictor architectures, like perceptron-based or pure ML-based predictors.
Complexity of Interaction and Edge Cases: The design requires several supporting mechanisms to function, such as the single-cycle override (Section 5.1, page 8) to handle simple, short-history branches that are disadvantaged by the ahead scheme. While effective, this feels like a patch that adds complexity and hints that the core ahead mechanism isn't universally superior. The discussion of late predictions and pipeline restarts also suggests a non-trivial amount of control logic is required to manage the system.
Limited Exploration of the Branch Target Problem: The paper focuses almost exclusively on solving the latency of the directional predictor. However, a complete ahead prediction system must also predict branch targets ahead of time. The baseline architecture described (Section 5, page 7) explicitly keeps the BTB non-ahead-pipelined. An indirect branch within the N-branch ahead window presents a significant challenge, as its target is unknown when the ahead prediction is initiated. The paper does not fully explore the performance implications of this, which could be a major limiter in code with frequent indirect jumps or calls.

Questions to Address In Rebuttal

Generalizability: Could the authors elaborate on how their core insight—that only a few missing history paths are realized—could be applied to other classes of predictors? For instance, in a perceptron predictor, would this translate to training a small number of specialized weight tables for different missing history patterns?
Necessity of the Single-Cycle Override: The single-cycle override (Section 5.1) seems critical for recovering 1% of performance. Does this imply a fundamental weakness in ahead prediction for certain types of branches (e.g., those with very short, simple history patterns)? Could the authors comment on the characteristics of the branches that benefit from this override?
Interaction with Ahead Target Prediction: The paper focuses on directional prediction latency. How does the proposed scheme contend with indirect branches within the ahead window (e.g., the 5 branches being skipped)? Since the BTB is not ahead-pipelined, an indirect branch would seem to stall the entire process until its target is resolved, potentially negating much of the latency-hiding benefit in certain workloads. Could you clarify the expected performance impact of this limitation?

Review 3

Review Form: The Innovator (Novelty Specialist)

Summary

This paper addresses the well-known problem of prediction latency in modern, high-accuracy branch predictors. The established solution, "ahead prediction," suffers from a critical flaw: an exponential increase in energy and hardware cost as it must speculatively generate predictions for all 2^N possible paths, where N is the number of skipped branches.

The authors' contribution is twofold. First, they present a crucial, empirically-backed insight: the number of actually observed future paths (or "missing history patterns") for a given ahead history is far smaller than the theoretical maximum of 2^N, often just one or two. Second, they propose a novel microarchitectural mechanism to exploit this insight. Instead of reading 2^N predictions, they modify the TAGE predictor by adding a "secondary tag" to each entry. This tag explicitly encodes the specific missing history pattern an entry is trained for. This allows the predictor to support multiple future paths while only performing a single read per TAGE table, fundamentally changing the energy scaling from exponential to linear. The result is a design that makes ahead prediction practical for hiding multi-cycle latencies.

Strengths

Novel Core Insight and Mechanism: The central contribution of this paper is genuinely novel. While "ahead prediction" is a known concept [38, 41] and its exponential cost is its primary known inhibitor, no prior work has proposed a practical mechanism to break this scaling. The combination of the experimental insight (Section 3, Figure 2, page 3) that few paths materialize and the specific architectural solution (the secondary tag, Section 4, page 5) is, to my knowledge, new.
Clear Differentiation from Prior Art: The authors correctly identify the seminal work by Seznec [38] as the closest prior art for multi-path ahead prediction. They clearly articulate the "delta": Seznec's approach relies on reading out consecutive entries (a brute-force spatial solution), leading to exponential energy scaling. This paper's secondary tag mechanism is a fundamentally different, more efficient encoding scheme that multiplexes predictions for different future paths within the standard TAGE structure. This change directly attacks the primary weakness of the prior art.
Data-Driven Justification: The novelty is not based on a mere assumption. The analysis in Section 3 ("Number of Missing History Patterns") provides a strong empirical foundation for the entire design. Showing that for 64-bit histories, over 98% of cases see four or fewer patterns is a powerful result that justifies the feasibility of their tagged approach.

Weaknesses

Incremental Nature of the Architecture: While the core idea of the secondary tag is novel in this application, the overall architecture is an extension of the existing, highly-complex TAGE predictor. The novelty lies in a clever modification, not a fundamentally new prediction algorithm. The work is therefore an evolutionary step, albeit a very important one for making a known technique practical.
Potential for Conceptual Overlap in Other Domains: The core concept is "tagging a prediction with the intermediate state required to validate it." While this appears new for branch predictor ahead history, similar concepts may exist in other areas of speculative execution. For instance, certain forms of data prefetching or value prediction might speculatively generate a result and tag it with the dependency information required for its use. The authors should be very careful in positioning this as the absolute first use of such a disambiguation technique in microarchitecture.

Questions to Address In Rebuttal

On the Novelty of the Implementation: The novelty lies in adding a secondary tag to the main TAGE tables. Have the authors considered alternative implementations of their core insight? For instance, instead of modifying the large TAGE tables, could one use a small, separate "Missing History Pattern Cache" indexed by the ahead PC/history that stores only the few (e.g., 1-4) observed secondary tags and their corresponding predictions? This might avoid polluting the main TAGE structure and could be a different, potentially novel, design point. Please justify the decision to integrate this mechanism directly into TAGE.
On the Robustness of the Hash Function: The secondary tag is generated by a hash of intermediate branch targets (Figure 6, page 6). This is a key component of the novel mechanism. How does this design handle pathological cases where different sequences of taken/not-taken branches resolve to the same set of intermediate targets, potentially causing aliasing in the secondary tag hash? Is the target-based hash strictly superior to a simpler hash of the directional bits of the missing history?
Final Check on Prior Art: The idea of disambiguating multiple future paths stemming from a single speculative starting point is fundamental. Can the authors state with confidence that no prior work in related fields (e.g., trace caches, block-based predictors, or even speculative multithreading) has used a tagging mechanism to associate a speculative prediction with a specific, yet-to-be-resolved intermediate execution path? A more thorough search for conceptual analogues would strengthen the novelty claim.

Leveraging control-flow similarity to reduce branch predictor cold effects in microservices

Abstract

Modern datacenter applications commonly adopt a microservice software architecture, where an application is decomposed into smaller interconnected microservices communicating via the network. These microservices often operate under strict latency ...

ACM LINK

Reviews

Review 1

Reviewer: The Guardian

Summary

This paper identifies branch predictor cold-start effects as a significant performance bottleneck for latency-sensitive microservices. The authors propose a new hybrid prediction architecture, Similarity-based Branch Prediction (SBP), which leverages the observed high control-flow similarity (CFS) between different requests to the same microservice. The core idea is to use a pre-recorded "reference trace" of a past execution to predict the control flow of a new execution. An instantiation, CHESS, is presented, which optimizes this approach by filtering the reference trace to only include "hard-to-predict" branches, relying on a conventional predictor and static hints for the rest. The authors claim substantial MPKI reductions (up to 94%) and performance within 95% of a fully warm system, at a modest hardware cost of 18.1KB.

While the problem is well-defined and the proposed direction is intriguing, the work rests on a fundamentally fragile assumption of a single, static reference trace. Furthermore, the complex trace reduction heuristics introduce critical blind spots in execution tracking that make the proposed reconvergence mechanism appear unreliable under realistic conditions. The evaluation, while showing impressive numbers, fails to adequately stress-test these core assumptions, calling the robustness and generality of the solution into question.

Strengths

Problem Motivation: The paper does an excellent job motivating the problem. The analysis in Section 2, particularly the Top-Down analysis in Figure 1, clearly establishes that branch predictor cold effects ("Frontend Bound Branch Resteers") are a major contributor to performance degradation in microservices.
CFS Analysis: The study in Section 3, which quantifies the control-flow similarity across requests, provides a solid empirical foundation for the work. The methodology for identifying reconvergence points and measuring coverage/accuracy is sound and the findings (high similarity) justify the exploration of a similarity-based predictor.

Weaknesses

Fundamental Fragility of the Single Reference Trace: The entire SBP/CHESS architecture is predicated on the existence of a single reference trace that is representative of future executions. This is an extremely brittle design choice. The authors state they select the trace that "maximizes coverage" across a training set (Section 6), but this offers no guarantee of robustness against workload drift or even minor variations in request inputs that trigger different control paths. The paper provides no sensitivity analysis whatsoever on the choice of this reference trace. What is the performance if a less-than-optimal trace is chosen? What happens if the workload behavior shifts slightly after the offline profiling period? The current design appears to be a form of extreme overfitting to a training workload, which is untenable for production datacenter environments.
Unreliable Reconvergence Mechanism Due to Trace Reduction: The CHESS optimization, which removes "easy-to-predict" (EP) branches to shorten the trace, is a critical flaw. By removing these branches, the predictor loses visibility into the true execution path. The system's state (convergent or divergent) and its understanding of where it is in the program flow now depend on the underlying conventional predictor correctly handling these EP branches.

Consider the scenario described in Section 5.2: an HP branch is followed by a removed EP branch. The system assumes that after the HP branch, control will eventually arrive at the next HP or rEP branch in the trace. But what if the conventional predictor mispredicts the removed EP branch? The actual execution has now diverged, but the CHESS mechanism is completely unaware of this fact. It will continue to believe it is in a convergent state until it encounters the next HP branch, at which point its prediction will likely be wrong, and its calculated reconvergence point (based on the original, incorrect path) will be invalid. This creates a cascading failure scenario that is not analyzed or even acknowledged. The small-scale example in Figure 7 is insufficient to demonstrate that this logic is sound for complex, real-world control flow.
Benchmark-Specific Hardware Costing: The claimed storage cost of 18.1KB is not a general result but an artifact of the specific benchmarks evaluated. The authors derive the sizes for the trace buffer and associated tables by analyzing the characteristics of the eight microservices where CHESS showed a benefit (Section 7, "Storage Requirements"). This is a post-hoc justification, not a principled derivation of hardware requirements. The paper fails to provide any analysis on the worst-case trace length or how storage costs would scale for applications with more complex control flow, rendering the 18.1KB figure potentially misleading.
Insufficient Evaluation of System Dynamics: The trace loading mechanism is glossed over. The authors suggest a bulk load via a "privileged hardware control interface" at the start of a request (Section 6). They claim a minimal overhead of 0.4%-1.1% (Section 7), but this appears to be a simple calculation based on trace size and memory bandwidth, ignoring real-world system complexities. It does not account for OS scheduling latency, potential contention on the memory bus from other active cores, or the cost of the privileged operation itself. This is a critical system-level interaction that has been abstracted away to the point of being unrealistic.

Questions to Address In Rebuttal

Please provide a sensitivity analysis regarding the choice of the reference trace. What is the performance degradation if the reference trace is chosen to be the one with the median or lowest coverage from the training set, instead of the maximum? How does performance degrade as the test workload increasingly diverges from the training workload used for trace selection?
The trace reduction in CHESS creates blind spots. Please provide a detailed analysis of the case where a conventional predictor mispredicts a removed "easy-to-predict" (EP) branch. How does the CHESS mechanism detect and recover from this, given that its own state is now inconsistent with the true execution path? Quantify how often this scenario occurs in your experiments and its impact on the subsequent accuracy of the similarity predictor.
The 18.1KB storage cost is derived from the evaluated benchmarks. What is the maximum observed HP+rEP trace length across all requests in your dataset, not just the chosen reference trace? Can you provide a more robust model for how trace length and storage cost scale with program complexity (e.g., number of basic blocks, cyclomatic complexity) rather than a value tuned to a small set of benchmarks?
The trace loading overhead of 1.1% seems optimistic. Can you provide a more detailed breakdown of this calculation? Does it include the software overhead of trapping to the OS, the OS scheduling the load, and potential queuing delays in the memory controller under a heavily loaded system, or is it purely a theoretical data transfer time?

Review 2

Reviewer Persona: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the significant performance penalty caused by branch predictor cold-starts in short-lived microservice executions. The authors identify that the core reason this penalty is so high is that the control flow across different requests for the same microservice is highly similar. The central contribution is a new hybrid branch predictor architecture, Similarity-based Branch Prediction (SBP), which leverages this observation. SBP uses an offline-generated "reference trace" of a typical request's control flow to make highly accurate predictions at runtime. The paper proposes a specific, pragmatic implementation called CHESS, which judiciously applies this similarity-based prediction only to branches identified as "hard-to-predict" by conventional mechanisms, thereby keeping the reference trace manageably small. The evaluation demonstrates that CHESS can reduce branch mispredictions (MPKI) by up to 94% compared to a cold predictor, achieving performance that is within 95% of an ideal warm-predictor baseline with a modest hardware storage cost.

Strengths

Excellent Problem Contextualization: The paper does a superb job of situating its contribution within a critical trend in modern computing: the shift towards latency-sensitive, decomposed microservice architectures. The initial analysis in Section 2.2 (page 2), using Top-Down analysis to pinpoint "Branch Resteers" as a major source of cold-start overhead, provides a compelling and quantitative motivation for the work. This is not a solution in search of a problem; it is a direct and well-reasoned attack on a known, significant bottleneck.
Elegant Synthesis of Existing Concepts: The core strength of this work lies in its synthesis of several established lines of research. It connects the observation of control-flow similarity (CFS), previously explored in works on web servers [11, 37], with the well-known "cold start" problem in branch prediction. The proposed solution is a logical and powerful next step from prior work like Ignite [54], which warmed up the BTB and a simple bimodal predictor. CHESS extends this "record and replay" philosophy to the far more complex state of a modern TAGE-style conditional predictor. Furthermore, the use of offline profiling via Intel PT places it in conversation with Profile-Guided Optimization (PGO) techniques like Whisper [38] and Thermometer [58], but offers a distinct, dynamic architectural alternative to static binary modification.
Pragmatic and Thoughtful Design: The design of CHESS is not a brute-force solution. The insight to filter the reference trace to only include "hard-to-predict" (HP) and a few "retained easy-to-predict" (rEP) branches is the key to its practicality. This reduces the trace length by orders of magnitude (as shown in Figure 12, page 12) while preserving most of the performance benefit. The mechanism for handling divergence and reconvergence is also well-defined, making the hybrid approach robust. This demonstrates a deep understanding of the trade-offs between prediction accuracy and implementation cost.
Strong and Convincing Results: The empirical evaluation is thorough and the results are impressive. Achieving performance that closes such a large gap to a warm baseline is a significant achievement. The comparison against multiple baselines, including a warm predictor, a static-hint-based predictor, and an "unbounded" predictor, effectively isolates the benefits of the proposed technique and demonstrates that the problem is truly one of cold state, not predictor capacity.

Weaknesses

My critiques are less about flaws and more about opportunities to further contextualize the work and explore its boundaries.

Understated Hardware Complexity: While the storage cost of the reference trace is well-quantified (18.1KB), the logic complexity added to the processor front-end is discussed at a high level. The front-end is one of the most timing-critical parts of a modern core. The machinery required to manage the Trace Buffer, the read pointer, the CSD tracking, and the state transitions between convergent/divergent modes (Figure 8, page 9) could have non-trivial implications for pipeline latency and clock frequency. A more detailed discussion of these implementation trade-offs would strengthen the paper.
Sensitivity to Workload Dynamics: The model relies on a single reference trace being representative of the vast majority of requests. The authors reasonably argue that datacenter workloads evolve slowly, allowing for periodic regeneration of traces. However, this assumption could be brittle. A sudden shift in the workload mix or the introduction of a new, popular feature could lead to a significant divergence from the reference trace, potentially degrading performance until a new trace is deployed. An analysis of the system's sensitivity to such "concept drift" would be a valuable addition. How gracefully does performance degrade as the live traffic deviates from the reference trace?
Implicit Trade-offs with Static PGO: The paper compares CHESS's accuracy to Whisper [38] but could more explicitly articulate the architectural trade-off space. CHESS requires new, specialized hardware but offers runtime flexibility (a new trace can be loaded without recompilation). Static PGO techniques modify the binary to embed prediction logic, working on existing hardware but requiring a slower re-deployment cycle to adapt to workload changes. A discussion of which scenarios favor one approach over the other would help position the contribution more clearly for system architects.

Questions to Address In Rebuttal

Beyond the 18.1KB trace storage, could the authors comment on the anticipated logic overhead and potential timing impact of integrating the CHESS mechanisms into a high-frequency processor front-end? Is the two-cycle override delay mentioned in Section 7 a conservative estimate?
How robust is the single reference trace model? Have the authors considered a scenario where the live request distribution is bimodal, with two common but distinct control-flow paths? Would the system require multiple reference traces, and how would it select the correct one at runtime?
Could the authors elaborate on the architectural trade-offs between their dynamic, trace-based approach and static, PGO-based approaches like Whisper? Specifically, in what deployment environments (e.g., public cloud FaaS vs. private enterprise microservices) would the hardware cost and runtime flexibility of CHESS be most advantageous compared to a software-only PGO solution?

Review 3

Paper: Leveraging control-flow similarity to reduce branch predictor cold effects in microservices Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes Similarity-based Branch Prediction (SBP), a hybrid branch prediction architecture designed to mitigate cold-start effects in microservices. The core mechanism involves recording a "reference execution trace" from a past request and using this trace at runtime to predict the control flow (branch directions and targets) for subsequent, similar requests. The system operates in a "convergent" mode, following the trace, and falls back to a conventional predictor upon a "divergence" (a mismatch between the trace's prediction and the actual execution path). The authors present a specific implementation, CHESS, which optimizes the reference trace by filtering out branches deemed easy-to-predict and applies this replay mechanism primarily to hard-to-predict branches. The stated goal is to achieve performance close to that of a warm predictor by effectively "replaying" the control flow of a canonical execution.

Strengths

The engineering and evaluation of the CHESS system are thorough. The optimizations proposed to reduce the reference trace size (Section 5.1), such as removing direct branches and easy-to-predict conditional branches while retaining key nodes for reconvergence (rEP), are practical and well-reasoned. The performance results are compelling, demonstrating a significant reduction in branch MPKI over a cold baseline.

Weaknesses

The central weakness of this paper, from a novelty standpoint, is that its core conceptual contribution is not new. The fundamental idea of recording a trace of a prior execution to warm up or guide front-end microarchitectural structures has been explored in prior work.

The authors' novel claim is the SBP architecture. However, this architecture is functionally and conceptually an extension of previously published ideas. The most proximate prior art includes:

Ignite [54]: This work, cited by the authors, explicitly "records and replays branch-target buffer (BTB) insertions to warm up a cold BTB together with a bimodal predictor." The mechanism is identical at a high level: record a trace of front-end events and replay it to pre-condition the hardware for a subsequent invocation. The primary delta in the present work is applying this concept to a more complex conditional predictor (TAGE-SC-L) rather than just the BTB and a simple bimodal predictor. While this is a valid engineering extension that requires solving new problems (e.g., managing more complex state), it does not represent a new paradigm. The core idea of "replay for branch prediction warmup" is already established by Ignite.
Lukewarm [53]: This work, also cited, "records and replays instruction accesses to warm up a cold instruction cache." Again, this establishes the precedent of using a recorded trace from a past serverless invocation to mitigate cold-start effects in a front-end structure (the I-cache). SBP applies the same pattern—record and replay—to a different, albeit related, structure.
Trace Caches [51]: The idea of following a pre-recorded path of execution is the foundational concept of a trace cache. While a trace cache stores decoded uops and focuses on fetch bandwidth, the underlying principle of identifying and replaying a common dynamic instruction sequence is the same. SBP can be viewed as a "prediction trace cache" that replays branch outcomes rather than instructions.

The offline analysis component of SBP/CHESS, where a representative trace is selected and optimized, is a well-established methodology from the domain of Profile-Guided Optimization (PGO). Works like Whisper [38] and Thermometer [58] already use extensive offline profiling to create highly specialized prediction hints or logic for difficult branches. SBP's contribution here is the specific format of the output (an explicit trace for replay) rather than the process itself.

Therefore, the paper's contribution is not the invention of a new prediction method, but rather the synthesis and extension of the known "record-and-replay" technique to a modern, complex branch predictor. The "delta" over prior art is one of scope and engineering sophistication, not one of fundamental concept.

Questions to Address In Rebuttal

The authors must clearly demarcate the conceptual novelty of SBP from Ignite [54]. Beyond applying the replay concept to a TAGE predictor instead of a BTB/bimodal predictor, what is the fundamental new idea that separates these two works? The paper currently positions itself as a new approach, but it appears to be a direct and logical extension of Ignite's core mechanism.
The paper's explicit handling of divergence and reconvergence (Section 4.3) using post-dominator analysis is robust. Is this mechanism itself claimed as novel, or is it the application of known compiler theory [12, 14] to the record-and-replay context? Prior replay-based schemes must have implicitly or explicitly handled divergence; please clarify how SBP's approach is fundamentally different and not just a more formal implementation.
Given the significant conceptual overlap with prior art, the contribution seems to lie in the engineering of the CHESS system, particularly the trace reduction heuristics (Section 5.2). Could the authors re-frame their contribution not as a novel prediction paradigm (SBP), but as a set of novel and practical optimizations (CHESS) that make the existing record-and-replay paradigm viable for complex, state-of-the-art predictors?

Cramming a Data Center into One Cabinet, a Co-Exploration of Computing and Hardware Architecture of Waferscale Chip

Abstract

The rapid advancements in large language models (LLMs) have significantly increased hardware demands. Wafer-scale chips, which integrate numerous compute units on an entire wafer, offer a high-density computing solution for data centers and can extend ...

ACM LINK

Reviews

Review 1

Paper Title: Cramming a Data Center into One Cabinet, a Co-Exploration of Computing and Hardware Architecture of Waferscale Chip Reviewer Persona: The Guardian (Adversarial Skeptic)

Summary

The authors present Titan, an automated co-design framework for wafer-scale systems, with the goal of optimizing performance and integration density within a single cabinet under cost and physical constraints. The framework proposes two primary architectural optimizations: the integration of configurable on-wafer memory dies and the enforcement of vertical area constraints across the hardware stack to improve density. Titan explores a hierarchical design space, from intra-chip core configuration to inter-chip cabinet layout, using a series of analytical models for performance, cost, and physical reliability to prune the space and identify optimal configurations. The paper claims that designs produced by Titan significantly outperform a state-of-the-art Dojo-like baseline.

While the ambition to create a holistic co-design framework for wafer-scale systems is commendable, this work rests on a foundation of overly simplified and critically unvalidated models. The staggering performance gains reported are not substantiated by rigorous evaluation, and the framework’s ability to produce physically realizable and truly optimal designs is highly questionable.

Strengths

Problem Definition: The paper correctly identifies a critical and timely research problem. The co-design of computing and hardware architectures for wafer-scale systems, particularly in the context of LLMs, is of significant interest to the community.
Holistic Scope: The authors should be credited for the comprehensive scope of their proposed framework. The hierarchical model, which attempts to connect low-level parameters like MAC unit counts to high-level decisions like cabinet chip layout (Section 3, Figure 6), represents a holistic view of the design problem.
Ablation Study Structure: The use of distinct baselines (D-Arch, C-Arch, S-Arch) to isolate the benefits of the proposed architectural optimizations is a methodologically sound approach to structuring the evaluation (Section 5.2.3). This allows for a clear, albeit theoretical, attribution of the claimed performance gains.

Weaknesses

My primary concerns with this paper are the severe oversimplifications in its core models, which undermine the credibility of the entire framework and its results.

Fatally Simplified Performance Evaluation: The performance evaluation relies on the Astra-sim2.0 simulator with the explicit admission of "ignoring congestion and routing control" (Section 5.2.2). For a wafer-scale system where thousands of cores communicate across a massive mesh network, congestion is not a second-order effect; it is a primary performance bottleneck. Claiming a 10.66x performance improvement for Llama2-72B (Section 5.3) based on a congestion-free model is fundamentally unsound. The reported performance is likely an unobtainable theoretical peak rather than a realistic projection of system throughput.
Unvalidated and Abstract Physical Models: The framework’s ability to prune the design space relies on a set of "theoretical prediction models" for interposer reliability, including warpage, thermal, SI, and PI (Section 4.2.3). These are presented as simple analytical equations with un-specified coefficients (Kcool, Kbp, KSI, KPI). There is no evidence that these models have been validated against industry-standard finite element analysis (e.g., Ansys) or electromagnetic simulation tools (e.g., Clarity, SIwave). Using such unverified heuristics to discard vast portions of the design space is perilous; the framework could be systematically discarding valid, high-performance designs or, conversely, retaining physically unrealizable ones. The mention of an in-house prototype (Figure 12) is insufficient validation for a model meant to span a parameter space as large as the one defined in Table 2.
Superficial Cost and Yield Modeling: The cost model (Section 4.4) hinges on a yield parameter, Ydie. Yield modeling for wafer-scale integration is a notoriously complex field that depends heavily on defect distribution, redundancy mechanisms, and harvesting strategies. The paper offers no details on its yield model beyond this single variable. It is unclear how this model accurately captures the cost trade-offs of manufacturing a massive interposer with numerous heterogeneous KGDs. Without a credible and detailed cost model, the central claim of optimizing performance "under the same cost constraint" is unsubstantiated. The comparison against the "modeled Dojo tray" is only as meaningful as the accuracy of the model, which is not demonstrated.
Architectural Ambiguity: The concept of "on-wafer memory" is central to the computing architecture, yet its implementation is described at a very high level of abstraction (Section 2.2.1). How are these memory dies architecturally integrated? Do they function as a distributed last-level cache, partitioned memory space, or something else? What are the latency and bandwidth characteristics of the die-to-die links connecting them to compute KGDs? The paper does not provide enough detail to assess the feasibility or performance implications of this core proposal. The gains attributed to this feature are therefore speculative.

Questions to Address In Rebuttal

Please provide a robust justification for using a congestion-free network model for performance evaluation. Can you present sensitivity studies or data from existing literature to demonstrate that congestion is a negligible factor for the LLM workloads and system scales you are analyzing? Otherwise, how can the performance claims in Figure 13 be considered credible?
The reliability models in Section 4.2.3 are critical for design space pruning. Please provide evidence of their validity. Specifically, show a comparison of the predictions from your analytical models (for warpage, IR drop, etc.) against results from established simulation tools for at least three diverse design points (e.g., small chip/low power, large chip/high power, high IO density).
Elaborate on the yield model (Ydie) used in your cost calculations. How does it account for defect tolerance and redundancy, which are essential for achieving viable yields in wafer-scale systems? How was this model calibrated, and what is its margin of error when comparing the cost of your proposed T-Arch against the D-Arch baseline?
Please provide a more detailed architectural description of the "on-wafer memory" subsystem. What is the coherence protocol, the memory access latency from a compute core, and how does the NoC prioritize traffic between compute-compute and compute-memory communication?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Review Form

Summary

This paper presents Titan, an automated framework for the co-design of wafer-scale computing systems, aiming to "cram a data center into one cabinet." The central thesis is that current wafer-scale systems, while powerful, suffer from inefficiencies due to a lack of coordinated design between the logical computing architecture (cores, memory hierarchy) and the physical hardware architecture (packaging, power, cooling).

To address this, the authors propose a hierarchical, parameterized model of a full cabinet system, from individual compute cores up to the arrangement of wafer-scale chips. The Titan framework uses this model to perform a comprehensive design space exploration (DSE). The core contributions of this framework are:

A holistic co-design methodology that simultaneously optimizes compute and physical hardware parameters.
The introduction of a "vertical area constraint" as a key heuristic to prune the enormous design space by enforcing area alignment between stacked functional layers.
The integration of early-stage physical reliability models (warpage, SI/PI) and cost models to eliminate unfeasible designs and enable multi-objective optimization.

Through simulation, the authors demonstrate that Titan-generated designs significantly outperform a baseline modeled on Tesla's Dojo architecture, and they use the framework to derive valuable, non-obvious insights about the relationship between single-chip area and overall cabinet cost-efficiency.

Strengths

This is an ambitious and important piece of work that sits at the confluence of computer architecture, EDA, and advanced packaging. Its primary strengths are:

Holistic, Cross-Layer Scope: The paper's most significant strength is its scope. It successfully bridges the chasm between microarchitectural decisions (e.g., number of MAC units per core, on-wafer memory dies) and system-level physical design (e.g., interposer layers, C4 bump pitch, cabinet layout). This connects two worlds that are often optimized in isolation. This approach is reminiscent of the broader push towards System-Technology Co-Optimization (STCO), and this paper provides an excellent, concrete example of STCO applied to the design of next-generation AI hardware.
Grounded in Realistic Technology: The framework is not built in a vacuum. The authors ground their architectural models in real-world, state-of-the-art systems and technologies, referencing Tesla's Dojo (Figure 2), TSMC's InFO_SoW packaging, and a specific process node (12nm). The inclusion of their own in-house prototyped wafer-scale chip (Figure 12) is a powerful addition that lends significant credibility to their cost, yield, and area models, elevating this from a purely theoretical exercise.
A Practical Heuristic for a Vast Problem: The design space for a wafer-scale cabinet is combinatorially explosive. The "vertical area constraint" introduced in Section 2.2.2 and modeled in Section 4.2.2 is a simple yet elegant heuristic to manage this complexity. By forcing the area of supporting layers (cooling, substrate) to be proportional to the compute layer, Titan can effectively prune designs that would be spatially inefficient, drastically improving the efficiency of the DSE process.
Generates Actionable Architectural Insights: The true value of a DSE framework lies not just in finding a single optimal point, but in revealing the underlying design trade-offs. The analysis in Section 5.4 ("Design Considerations for Single-chip Area") and the accompanying Figure 14 are excellent examples of this. The finding that the largest possible chip is not always the most cost-effective solution for a cabinet—due to discrete chip array configurations and cost constraints—is a crucial, non-obvious insight for system architects.

Weaknesses

While the work is strong, its positioning and some underlying assumptions could be strengthened. These are not fatal flaws but opportunities for improvement.

Understated Connection to Broader Fields: The authors could better contextualize their work within the established literature of DSE and STCO. While they have built an impressive system, the paper would have greater impact if it explicitly framed Titan as an advanced STCO framework for wafer-scale systems, drawing clearer parallels to and distinctions from prior work in the EDA and packaging communities.
Simplifications in Performance Modeling: The performance evaluation relies on Astra-sim2.0 with "simplified assumptions (ignoring congestion and routing control)" as noted in Section 5.2.2. In massively parallel systems like these, network congestion can be a first-order performance bottleneck. Ignoring it may lead the optimizer to favor designs with high theoretical bandwidth that would not be realizable in practice. This simplification weakens the claims about simulated performance improvements.
Abstraction of the Cost Model: The cost model in Section 4.4 is comprehensive for an academic work, including wafer cost, yield, bonding, and cooling. However, the true cost of such systems is also heavily influenced by factors like Non-Recurring Engineering (NRE) costs, the complexity of testing a wafer-scale part, and supply chain logistics, which are not captured. While perfectly modeling this is beyond scope, acknowledging these other major cost drivers would provide a more complete picture.

Questions to Address In Rebuttal

The "vertical area constraint" margin, darea, is a critical parameter for the exploration process. The paper states a default value of 0.72 was chosen based on experiments (Section 4.5). Could the authors comment on the sensitivity of the final design quality to this parameter? How might the optimal darea change for different technology nodes (e.g., 3nm vs 12nm) or for different optimization targets (e.g., pure performance vs. performance/watt)?
Regarding the performance simulations, can the authors elaborate on the potential impact of network congestion? For example, would including congestion effects likely favor architectures with more on-chip memory (to reduce off-chip traffic) or different on-chip network topologies, even if it meant a lower peak theoretical FLOPS?
The Titan framework appears to be highly adept at exploring the design space for Dojo-like tiled architectures. How generalizable is the framework? What key modifications would be necessary to model and optimize a fundamentally different wafer-scale architecture, such as a more monolithic Cerebras-style design with a unified memory fabric, or a heterogeneous system integrating different types of compute dies on the same wafer?

Review 3

Paper Title: Cramming a Data Center into One Cabinet, a Co-Exploration of Computing and Hardware Architecture of Waferscale Chip Reviewer Persona: The Innovator (Novelty Specialist)

Summary

The authors present "Titan," an automated framework for the co-design and exploration of wafer-scale, single-cabinet data center architectures. The framework is built upon a hierarchical, parameterized model that spans from the intra-chip level (compute cores, on-wafer memory) to the inter-chip cabinet level (chip arrays, host interfaces). The central claim of novelty lies in the framework's methodology, which employs two key mechanisms to navigate the vast design space: 1) a "vertical area constraint" to enforce co-optimization across the physically stacked layers of a wafer-scale system (cooling, compute, interposer, substrate), and 2) the integration of early-stage, predictive models for physical reliability (warpage, SI, PI) to prune unfeasible designs before full evaluation. The goal is to automate the discovery of cost-performance optimal wafer-scale systems.

Strengths

The primary strength of this work lies in its novel synthesis and systematization of design principles for the emerging domain of wafer-scale systems. While individual components of the framework draw from existing concepts, their integration to address the unique, multi-physics challenges of a monolithic wafer-scale cabinet is new.

The "Vertical Area Constraint" Heuristic: The most salient novel concept is the "vertical area constraint" proposed in Section 2.2.2 and implemented in Section 4.2.2. While co-design frameworks exist, this is a simple, physically-grounded, and—to my knowledge—newly articulated heuristic specifically for vertically stacked, wafer-scale packages. It elegantly captures the cross-layer dependency on area, directly targeting the integration density problem and serving as a powerful pruning mechanism in the design space exploration (DSE).
Methodological Shift in Reliability Analysis: The paper proposes moving reliability analysis from a late-stage validation step to an early-stage architectural filter. The use of predictive models for interposer warpage, SI, and PI during DSE (Section 4.2.3) is a significant methodological advance over prior art, where such considerations are typically too computationally expensive for architectural exploration. This integration is crucial for making the exploration of such a large and complex design space tractable.
Formalization of the Wafer-Scale Co-Design Problem: Existing works on wafer-scale systems like Cerebras' WSE and Tesla's Dojo are point-solution case studies. This paper is the first I have seen to propose a generalized, automated framework to explore the entire design space of such systems. The novelty here is the shift from demonstrating a single instance to creating a methodology for discovering a whole class of optimal instances.

Weaknesses

While the synthesis is novel, the work builds heavily on pre-existing concepts, and the novelty of some individual components must be carefully delineated.

Incremental Novelty of Component Models: The paper is fundamentally a work of engineering synthesis. The core ideas of parameterized performance/cost modeling, design space exploration, and even reliability modeling are not new in themselves. For example, cost modeling for multi-die systems has been explored in works like "Chiplet actuary" (Feng and Ma, DAC '22), which the authors cite [10]. The reliability models for warpage, SI, and PI are explicitly based on established theoretical formulas from prior works, which are also cited ([49], [52], [18]). The novelty is in their adaptation and integration, not their fundamental formulation.
Conceptual Overlap with 3D-IC/Chiplet Co-Design: The concept of adding configurable memory dies alongside compute dies (Section 2.2.1) is a direct extension of existing 2.5D/3D packaging trends (e.g., TSMC's CoWoS-S with HBM). The novelty is not the idea of on-package memory itself, but rather the framework's ability to quantitatively evaluate the trade-offs of including it in a wafer-scale context. This distinction is subtle and could be made clearer. The overall co-design approach bears conceptual resemblance to DSE frameworks for heterogeneous chiplet-based systems, with the key differentiator being the specific constraints and scale of a monolithic wafer.

Questions to Address In Rebuttal

Pinpointing the Core Novel Mechanism: The concept of system-level co-design is well-established. Please articulate precisely which specific mechanism or model within Titan represents the most significant conceptual leap from prior co-design frameworks for complex, multi-chiplet systems. Is it primarily the "vertical area constraint," or is there another fundamental innovation?
Quantifying the Novelty of Reliability Models: Regarding the predictive reliability models detailed in Section 4.2.3, please clarify the delta between your work and the direct application of existing theoretical models from the cited literature (e.g., [49], [52]). What specific adaptations or novel assumptions were required to make these models sufficiently fast and accurate for early-stage architectural DSE, as opposed to late-stage signoff?
Robustness of the Vertical Area Constraint: The "vertical area constraint" is an elegant heuristic. However, could you discuss its potential limitations? Are there plausible scenarios where this hard constraint might prematurely prune a non-obvious, globally optimal design? For instance, a design with a significantly larger substrate to accommodate superior power delivery might be discarded despite enabling a disproportionately higher compute performance.
Generalizability of the Framework: While the framework is presented for wafer-scale systems, its core principles seem applicable to other complex, 3D-stacked heterogeneous packages. Could the Titan methodology, particularly the vertical constraint and early reliability checks, be considered a novel, general-purpose approach for co-design in the advanced packaging era, beyond just wafer-scale integration?

Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines

Abstract

The rapid increase in inter-host networking speed has challenged host processing capabilities, as bursty traffic and uneven load distribution among host CPU cores give rise to excessive queuing delays and service latency variances. To cost-efficiently ...

ACM LINK

Reviews

Review 1

Paper Title: Dynamic Load Balancer in Intel Xeon Scalable Processor: Performance Analyses, Enhancements, and Guidelines Reviewer: The Guardian

Summary

This paper presents an analysis of the Dynamic Load Balancer (DLB), an on-chip accelerator in recent Intel Xeon processors. The authors first conduct a microbenchmark comparison showing DLB's superior throughput over software-based load balancers, but identify that feeding the DLB still consumes significant host CPU cycles. To address this, they propose AccDirect, a system that uses PCIe Peer-to-Peer (P2P) communication to allow a SmartNIC to directly enqueue work descriptors into the DLB, bypassing the host CPU. The evaluation shows that AccDirect maintains performance comparable to host-driven DLB while reducing system power, and outperforms a static hardware load balancer (RSS) in an end-to-end key-value store application by 14-50% in throughput. Finally, the paper provides a set of performance guidelines for configuring DLB.

Strengths

The paper provides the first in-depth, public characterization of a novel and relevant piece of commercial hardware (Intel DLB). This exploration of a new feature is valuable to the community.
The problem identification is clear and well-motivated. The analysis in Section 3.2 (specifically Figure 4) correctly identifies the host-side enqueue operation as the next bottleneck after offloading the core load-balancing logic, which provides a strong foundation for the proposed solution.
The core architectural idea of AccDirect—chaining a peripheral (NIC) to an on-chip accelerator (DLB) via PCIe P2P—is a compelling concept for building more disaggregated and efficient systems.

Weaknesses

My primary concerns with this paper relate to the rigor of the experimental evaluation and the clarity of the claimed contributions. While the ideas are interesting, the evidence provided is not sufficient to fully substantiate the claims.

Unconvincing Experimental Baseline: The main end-to-end application performance claim (Section 4.5, Figure 10) relies on a comparison between the proposed dynamic load balancer (AccDirect-DLB) and a static hash-based load balancer (RSS). The chosen workloads (Masstree with mixed GET/SCAN operations) are explicitly designed to have variable service times, a scenario that is known to be the worst-case for static balancers like RSS and the best-case for dynamic balancers. This comparison feels engineered to highlight the benefits of DLB rather than rigorously comparing it against a credible alternative. A state-of-the-art dynamic software load balancer (e.g., inspired by Shenango or Caladan, which are cited but not compared against) would have been a far more appropriate and challenging baseline. As it stands, the 14-50% improvement over RSS is not surprising and its significance is questionable.
Omission of Critical Performance Metrics: The central premise of AccDirect is to save host CPU cycles by offloading the enqueue task to a SmartNIC. However, the paper completely omits any quantification of the resource consumption on the SmartNIC. The "SNIC agent" (Section 4.3) runs on the SmartNIC's Arm cores. How many cores are required? What is their CPU utilization under the loads tested in Figures 9 and 10? Without this data, the work has not been "eliminated" but merely "moved." This is a critical omission that undermines the claim of improved efficiency. It is entirely possible that the Arm cores on the SmartNIC become the new bottleneck or that the power saved on the host x86 cores is offset by the power consumed by the Arm cores.
Insufficient Substantiation of Power Savings: The abstract and Section 4.5 claim a system-wide power reduction of up to 10%. This claim is based on the data in Figure 9, which shows an absolute power saving of ~30W at the highest request rate. However, the paper never states the baseline total system power from which this 10% figure is calculated. Reporting a relative improvement without providing the denominator is not rigorous. The baseline power measurement for the entire server under the corresponding load must be explicitly stated for this claim to be verifiable.
Dilution of Research Contribution: A substantial portion of the paper (Section 6, Figures 11, 12, 13) is dedicated to a detailed parameter-tuning study of DLB. While this information is useful for engineers looking to use this specific Intel product, it reads more like an application note or a user guide than a novel research contribution suitable for a premier architecture conference. This extensive characterization dilutes the paper's focus and makes it feel less like a cohesive research paper and more like a combination of a system proposal (AccDirect) and product documentation.

Questions to Address In Rebuttal

The authors must address the following points to convince me of the paper's validity and contribution.

Please justify the choice of RSS as the primary hardware baseline in your end-to-end evaluation (Figure 10). Given the known limitations of static balancing for skewed workloads, how can you claim a significant advantage without comparing against a state-of-the-art dynamic software load balancer?
Please provide the resource utilization data for the SmartNIC agent used in the AccDirect experiments. Specifically, state the number of Arm cores utilized on the BlueField-3 DPU and their average CPU utilization percentage for the results shown in Figure 9 and Figure 10.
Regarding the 10% system-wide power saving claim, please state the absolute total system power consumption (in Watts) of the server for the baseline dlb-lib configuration at the 14.5 MRPS data point in Figure 9.
The evaluation of AccDirect appears to be conducted under conditions highly favorable to dynamic load balancing (i.e., significant service time variance). How do the performance benefits of AccDirect over the baseline (RSS) change as the workload becomes more uniform (i.e., as the variance in service times approaches zero)? A sensitivity analysis is needed to understand the boundaries of your solution's benefits.

Review 2

Reviewer Persona: The Synthesizer (Contextual Analyst)

Summary

This paper presents a comprehensive study of the new Intel Dynamic Load Balancer (DLB), an on-chip accelerator designed to offload packet scheduling and queue management from host CPU cores. The work is structured into three main contributions. First, it provides a thorough performance characterization of DLB, demonstrating its significant throughput and scalability advantages (up to 100 MPPS) over traditional software-based load balancers, which saturate around 40 MPPS. Second, and most importantly, it identifies a key limitation of the conventional DLB usage model: it still consumes significant host CPU cycles to simply prepare and enqueue work descriptors. To solve this, the authors propose "AccDirect," a novel system architecture that leverages PCIe Peer-to-Peer (P2P) communication to create a direct control path between a network interface card (NIC) and the on-chip DLB. This "accelerator chaining" approach effectively bypasses the host CPU, reducing system-wide power consumption by up to 10% and improving end-to-end application throughput by 14-50% compared to baselines. Third, the paper offers a valuable set of practical guidelines for configuring and optimizing DLB, drawn from an extensive microbenchmark analysis of its advanced features.

Strengths

Timeliness and High Relevance: The paper provides the first in-depth, public analysis of a new and important piece of commodity hardware. As datacenters grapple with the "datacenter tax" of managing high-speed networks, understanding and optimizing on-chip accelerators like DLB is of paramount importance to both the systems research community and industry practitioners.
Novel and Impactful Systems Contribution: The core idea of AccDirect is a significant contribution. While its constituent technologies (PCIe P2P, SmartNICs, on-chip accelerators) are not new in isolation, the authors' work in integrating them into a cohesive, host-bypassing architecture is a novel and powerful systems concept. This work serves as an excellent blueprint for a future where data flows are orchestrated between on-chip and off-chip accelerators with minimal host CPU intervention. The successful demonstration of chaining an I/O device directly to an integrated accelerator is a key step towards more efficient, accelerator-centric server architectures.
Thorough and Methodical Evaluation: The evaluation is comprehensive and compelling. It starts with foundational microbenchmarks to motivate the problem (Section 3, Figure 4, page 5), proceeds to a direct evaluation of the proposed solution's power and performance benefits (Section 4.5, Figure 9, page 8), and culminates in a real-world, end-to-end application benchmark (Masstree KVS, Figure 10, page 9) that demonstrates tangible benefits over both hardware and software baselines. This multi-layered approach provides strong evidence for the authors' claims.
Exceptional Practical Value: The detailed characterization study and the resulting guidelines presented in Section 6 are a major strength. The authors demystify a complex hardware component with a vast configuration space, providing clear implications and trade-offs for parameters like port types, wait modes, and priority levels. This section, on its own, is a valuable resource that will enable other researchers and engineers to make effective use of this new hardware.

Weaknesses

My criticisms are less about flaws and more about opportunities to further elevate the work's positioning and impact.

Reliance on a Programmable SmartNIC: The current implementation of AccDirect depends on a sophisticated SmartNIC (NVIDIA BlueField-3) to act as the agent that prepares and enqueues work to the DLB. While this is an excellent choice for a research prototype, it limits the immediate applicability of the approach, as such devices are not yet ubiquitously deployed. The paper acknowledges this in the Discussion (Section 5, page 9) but could benefit from a more detailed exploration of what would be required to enable this functionality with less-programmable, commodity NICs.
Understated Framing of the Core Concept: The authors frame the work primarily as an analysis and enhancement of DLB. However, the more profound contribution is the demonstration of a general architectural pattern: host-transparent, P2P-based accelerator chaining. This pattern has implications far beyond just DLB and networking. The paper could be strengthened by explicitly framing AccDirect as a case study of this broader architectural principle, connecting it more strongly to the vision of accelerator-centric or disaggregated systems explored in works like Lynx [53].
Limited Scope of Application Evaluation: The end-to-end evaluation focuses on an RDMA-based Key-Value Store. While this is a critical datacenter workload, the AccDirect pattern could be highly beneficial for other domains, such as Network Function Virtualization (NFV) service chains, storage disaggregation (NVMe-oF), or data-intensive computing pipelines. A brief discussion on how the AccDirect principles might apply to these other areas would broaden the perceived impact of the work.

Questions to Address In Rebuttal

Could the authors elaborate on the path to implementing a system like AccDirect without a fully programmable SmartNIC? What specific, minimal hardware capabilities would a more conventional NIC need (e.g., a flexible DMA engine, limited packet parsing logic) to act as an enqueuing agent for DLB?
The concept of P2P accelerator chaining is very powerful. Beyond the NIC->DLB chain demonstrated here, have the authors considered how this pattern could be extended? For example, could a worker core, after being scheduled by DLB, use a similar mechanism to directly chain a task to another on-chip accelerator like Intel's QuickAssist Technology (QAT) or Data Streaming Accelerator (DSA) without returning to a central scheduler?
The use of RDMA atomics to manage the DLB credit pool from the SmartNIC is a clever solution to a tricky problem (Section 4.3, page 7). Could you comment on any performance implications of executing atomic operations over the PCIe bus versus on the host CPU? Specifically, how does the PCIe Root Complex handle the contention between PCIe atomics from the NIC and locked instructions from host cores targeting the same memory location?

Review 3

Review Form: The Innovator (Novelty Specialist)

Summary

This paper presents a comprehensive performance analysis of the Intel Dynamic Load Balancer (DLB), a recently introduced on-chip accelerator. The authors first characterize its performance against software-based alternatives, identifying a significant limitation: the high host CPU cost required to enqueue work descriptors to the DLB at high packet rates. The primary claimed contribution is a system architecture named AccDirect, which establishes a direct control path between a NIC (specifically, a SmartNIC) and the on-chip DLB. This is achieved using standard PCIe Peer-to-Peer (P2P) communication, allowing the SmartNIC to directly enqueue work descriptors into the DLB's hardware queues, thereby completely offloading this "datacenter tax" from the host CPU. The authors demonstrate that this approach saves up to 10% of system-wide power and improves the throughput of an end-to-end application by 14-50% compared to baselines.

Strengths

Clever Application of Existing Primitives: The core idea of using PCIe P2P to enable direct device-to-device communication is not new, but its application here is specific and clever. The authors have engineered a solution where an off-chip, peripheral device (a SmartNIC) directly issues control commands (work descriptor enqueues) to an on-chip, root-complex-integrated accelerator (the DLB). This moves beyond the common P2P use case of bulk data transfer (e.g., GPUDirect) and into the domain of direct, fine-grained accelerator control, which is a valuable engineering contribution.
Identifies and Solves a Concrete Problem: The paper does an excellent job of identifying a real-world performance bottleneck. The finding in Section 3.2 (Figure 4, page 5) that it requires five host CPU cores just to feed the DLB enough work to reach its 100 MPPS potential is a stark and compelling motivation. AccDirect provides a direct and effective solution to this specific problem.
Strong Empirical Results: The demonstrated benefits are significant. A 10% reduction in total system power (Figure 9, page 8) by offloading the enqueue task is a substantial gain in a datacenter context. The end-to-end application improvement further validates that the architectural change translates to real-world performance benefits.

Weaknesses

The primary weakness of this work lies in the degree of its conceptual novelty when viewed against the landscape of prior art in accelerator and SmartNIC offloading.

Conceptual Overlap with Prior Work on Accelerator-Centric Architectures: The high-level concept of a SmartNIC acting as the central orchestrator for data and control flow, bypassing the host CPU, has been previously proposed.
- Lynx [53] proposed an "accelerator-centric" architecture where a SmartNIC offloads both data and control planes, using PCIe P2P to distribute messages directly to other accelerators' memory. While AccDirect's mechanism of writing to a control register (DLB's producer port) is more direct than writing to a memory queue, the fundamental concept of SmartNIC-led P2P dispatch is the same. The authors cite Lynx in their related work (Section 7, page 12) but do not sufficiently distinguish their core idea from it. AccDirect appears to be a highly effective, but specific, instantiation of the Lynx philosophy.
Prior Art in Direct P2P Device Control: The idea of one PCIe device directly controlling another via P2P MMIO writes is also not fundamentally new.
- FlexDriver [11] demonstrated an architecture where an accelerator could host a "NIC driver" to directly control a NIC over PCIe P2P. AccDirect implements the inverse: a NIC (or its embedded CPU) controlling another accelerator (DLB). While the targets are different (on-chip vs. off-chip), the core mechanism of P2P-based device control is conceptually identical. The novelty delta here seems to be in the engineering specifics of targeting the DLB, not in the architectural pattern itself.
The "Framework" is a Specific Point-Solution: The authors present AccDirect as a general framework for "accelerator chaining." However, the implementation is tightly coupled to the specifics of the Intel DLB (its BAR structure, queueing mechanism) and a BlueField-3 SmartNIC (using its onboard Arm cores and RDMA capabilities). The choice to use one-sided RDMA verbs as the control primitive (Section 4.2, page 6) is presented as a key enabler, but this appears to be more of a software engineering choice for convenience and generality rather than a fundamental new mechanism. The same control could be achieved with more primitive, direct P2P MMIO writes from the SmartNIC, which is a known capability. This makes the contribution feel less like a new, generalizable framework and more like an exemplary, vertically-integrated system design.

Questions to Address In Rebuttal

Please clarify the precise conceptual novelty of AccDirect over Lynx [53] and FlexDriver [11]. Beyond the difference in the specific accelerators being chained, what is the fundamental architectural principle or mechanism in AccDirect that was not already proposed or demonstrated in these prior works? Is the primary contribution the engineering feat of successfully targeting an on-chip, root-complex-integrated device, and if so, what were the non-obvious technical hurdles that make this a distinct scientific contribution?
The paper frames AccDirect as a general "framework." To substantiate this claim, could the authors elaborate on how this framework would apply to chaining a different set of accelerators? For example, how would the principles of AccDirect be used to have a SmartNIC directly dispatch compute kernels to an on-chip GPU or schedule tasks on a domain-specific accelerator that does not use a producer-consumer queue model like the DLB?
The use of RDMA verbs for P2P control is a major part of the implementation. Is this choice fundamental to the novelty of the idea? Or is it an abstraction layer built on top of the already-existing PCIe P2P MMIO write capability? In other words, does the novelty lie in the use of P2P to control the DLB, or the use of RDMA verbs as the specific API to enact that control?

A4: Microarchitecture-Aware LLC Management for Datacenter Servers with Emerging I/O Devices

Abstract

In modern server CPUs, the Last-Level Cache (LLC) serves not only as a victim cache for higher-level private caches but also as a buffer for low-latency DMA transfers between CPU cores and I/O devices through Direct Cache Access (DCA). However, prior work ...

ACM LINK

Reviews

Review 1

Reviewer: The Guardian

Summary

The authors present a study of Last-Level Cache (LLC) contention on modern Intel server CPUs, specifically focusing on interactions with high-bandwidth I/O devices. The paper purports to uncover two novel sources of contention: (C1) a "directory contention" where I/O cache lines from DCA-designated ways migrate to hidden "inclusive ways" upon CPU access, conflicting with non-I/O workloads; and (C2) contention within DCA ways between high-throughput storage I/O and latency-sensitive network I/O. Based on these observations, the authors propose A4, a runtime framework that uses Cache Allocation Technology (CAT), performance counters, and a supposedly "hidden" per-device DCA control knob to orchestrate LLC allocation. The framework aims to improve the performance of high-priority workloads (HPWs) without significantly harming low-priority ones (LPWs).

Strengths

The paper attempts to tackle a relevant and increasingly challenging problem: managing shared resources (specifically, the LLC) in the face of high-bandwidth I/O, which is critical for modern datacenter performance.
The authors provide a step-by-step evaluation (A4-a through A4-d) in Figure 13, which is useful for attributing performance gains to specific components of their proposed solution. This incremental analysis is a methodologically sound way to demonstrate the contribution of each technique.

Weaknesses

My assessment finds significant issues with the paper's foundational claims, methodological rigor, and the robustness of the proposed solution. These must be addressed before the work can be considered for publication.

The Central "Directory Contention" Claim (C1) is Built on Speculation, Not Proof: The entire premise of the "directory contention" rests on an observation of increased LLC miss rates for X-Mem when it is allocated to way[9:10] (Figure 3b). The authors then link this phenomenon to a microarchitectural mechanism (migration of I/O cache lines to inclusive ways) described in a prior reverse-engineering study [65]. This is a classic case of correlation presented as causation. The evidence provided is circumstantial at best. The authors fail to provide any direct evidence of this cache line migration. The validation experiment in Figure 4 (disabling DCA removes the contention) only shows that the contention is DCA-related; it does not prove the specific migration mechanism hypothesized by the authors. Alternative explanations, such as complex interactions with the coherence protocol, interconnect contention, or prefetcher behavior, have not been ruled out. Building a significant portion of the paper on an unproven microarchitectural hypothesis is a critical flaw.
Unrepresentative Workloads and Lack of Generalizability: The characterization of storage I/O as benefiting "little from DCA" (Section 1, page 1) is a sweeping generalization based on a narrow, and frankly, convenient workload. The experiment in Section 3.2 uses FIO with O_DIRECT and random reads with large block sizes—a workload profile that is explicitly designed to bypass caches and maximize throughput, thus exhibiting poor temporal locality. This is a best-case scenario for the authors' argument but a poor representation of all storage I/O. What about metadata-heavy operations, database index traversals, or other workloads with higher temporal locality that would benefit from being cached via DCA? The paper conveniently ignores these scenarios, which undermines the credibility of claim C2 and the subsequent design choices in A4.
The Proposed Solution (A4) is Fragile and Over-Tuned: The A4 framework's robustness is highly questionable.
- "Hidden Knob": The solution relies on a "hidden feature" (Section 4.2, page 6) to selectively disable DCA for specific PCIe ports. The paper cites register perfctrlsts_0 [26]. Is this a documented, architecturally guaranteed interface for this purpose, or an undocumented MSR that may change or disappear in future CPU steppings or generations? Relying on such features makes the solution brittle and unsuitable for production environments. This is a major practical limitation that the authors do not sufficiently acknowledge.
- Magic Thresholds: The framework's logic is governed by a multitude of hard-coded thresholds (Table 1, page 9). For example, ANT_CACHE_MISS_THR is set to 90%, and DMALK_DCA_MS_THR to 40%. The sensitivity analysis in Figure 15 is superficial and only explores a small portion of a large, complex parameter space for a single workload mix. It demonstrates that the system is sensitive to these values but provides little confidence that the chosen values are optimal or will generalize to other workloads, systems, or SLOs. The framework appears to be over-fitted to the specific workloads evaluated.
Inadequate Comparison with the State-of-the-Art: The evaluation compares A4 against two weak baselines: "Default" (no CAT) and "Isolate" (naive static CAT partitioning). These are well-known strawmen. The paper fails to compare its results against more sophisticated, relevant prior work. For instance, the authors cite Yuan et al. [67] ("Don't forget the I/O when allocating your LLC"), which also proposes a dynamic CAT-based partitioning scheme to mitigate I/O-driven contention. A direct comparison to this or other utility-based dynamic partitioning schemes is essential to properly contextualize the contributions of A4. Without it, the claimed 51% improvement for HPWs is largely meaningless, as it is measured against baselines that are known to perform poorly under contention.

Questions to Address In Rebuttal

The authors must provide clear and convincing answers to the following questions:

Can you provide direct microarchitectural evidence (e.g., using specialized performance counters, if available) that I/O cache lines from DCA ways are indeed migrating specifically to the "inclusive ways" (way[9:10]) upon CPU access? If not, how can you definitively rule out other potential causes for the observed contention in Figure 3b?
Please justify the generalizability of your storage I/O characterization. Provide data for storage workloads with high temporal locality (e.g., database workloads) and demonstrate that they also do not benefit from DCA, or explain how A4 would handle cases where they do.
Clarify the exact nature of the per-device DCA control mechanism. Is this a feature officially documented and supported by the CPU vendor for this purpose? What is the guarantee of its stability across different processor models and microcode updates?
Why was A4 not compared against the system proposed in your cited work [67], which appears to address a very similar problem? Please provide such a comparison or a compelling reason for its omission.
The A4 framework relies on at least five key thresholds. How would a system administrator realistically set these parameters in a production environment with a dynamic mix of unseen workloads? Please provide a more robust justification for the chosen values beyond the limited sensitivity study presented.

Review 2

Review Form: The Synthesizer (Contextual Analyst)

Summary

This paper identifies and addresses two novel, microarchitecturally-rooted sources of contention in the Last-Level Cache (LLC) of modern datacenter servers. The work focuses on systems with non-inclusive LLCs and high-bandwidth I/O devices, a common configuration in contemporary hardware. The core contributions are twofold: first, the diagnosis of (C1) "directory contention," where I/O data touched by a CPU core unexpectedly migrates to special "inclusive ways" of the LLC directory, contending with non-I/O application data; and (C2) contention within Direct Cache Access (DCA) ways, where high-throughput storage I/O, which benefits little from DCA, pollutes the cache for latency-sensitive network I/O.

To address these issues, the authors propose A4, a runtime LLC management framework. A4 orchestrates LLC resources using existing hardware features like Intel's Cache Allocation Technology (CAT) and performance counters. It intelligently partitions the LLC based on workload priority, safeguards critical I/O buffers in DCA and inclusive ways, and notably, uses a device-specific control knob to selectively disable DCA for antagonistic storage workloads. The evaluation demonstrates that A4 significantly improves the performance of high-priority workloads (by 51%) without harming low-priority ones, effectively untangling these subtle but impactful performance bottlenecks.

Strengths

Novel and Insightful Problem Diagnosis: The primary strength of this paper lies in its deep and insightful diagnosis of previously unrecognized performance issues. The identification of "directory contention" (Section 3.1, page 4) is a particularly significant contribution. It connects the dots between the reverse-engineered understanding of non-inclusive cache directories (e.g., [65]) and the practical performance implications of I/O data flow. This moves the community's understanding beyond known issues like "DMA bloat" [2] into a more nuanced appreciation of modern cache-I/O interactions.
Contextual Relevance and Timeliness: The work is perfectly timed. Datacenters are increasingly deploying servers with non-inclusive LLCs (recent Intel Xeon and AMD Zen CPUs) and ever-faster I/O devices (100+ Gbps NICs and NVMe SSDs). The paper’s finding that storage I/O bandwidth is now on par with network I/O, yet has vastly different cache needs (Section 3.2, page 5), highlights a critical inflection point where old assumptions about DCA being universally beneficial are no longer valid. This paper provides a clear articulation of this emerging problem.
Pragmatic and Deployable Solution: A4 is not a theoretical proposal requiring new hardware. It is a software framework built entirely on existing, albeit sometimes obscure, hardware capabilities. By leveraging Intel CAT, performance counters, and a little-known register to control DCA on a per-device basis, the solution is grounded in reality. This pragmatism makes the work immediately relevant to practitioners and operators of large-scale systems looking to improve performance predictability and server utilization.
Excellent Explanatory Evaluation: The experimental methodology is a model of clarity. In particular, the incremental evaluation in Figure 13 (page 11), which shows the performance impact of applying each of A4’s strategies (A4-a through A4-d) one by one, is extremely effective. It allows the reader to directly map each proposed solution component to its real-world performance benefit, building a strong and convincing case for the final, complete system.

Weaknesses

While the core ideas are strong, the work could be better contextualized and its boundaries more clearly defined.

Limited Architectural Scope: The investigation is thoroughly conducted on an Intel Skylake server architecture. However, the non-inclusive LLC design principle has also been adopted by competitors like AMD (Zen architecture) and is present in newer Intel CPUs. The specific mechanism of "inclusive ways" tied to the directory might be an Intel-specific implementation. The paper would be significantly strengthened by a discussion on the generalizability of its findings. For instance, do other non-inclusive directory implementations present similar or different aliasing hazards? Acknowledging this limitation and speculating on the implications for other architectures would broaden the work's impact.
Complexity of the Heuristic-Based Framework: The A4 framework relies on a set of five thresholds (T1-T5) and two timing parameters to make its runtime decisions (Section 5.7, page 9). While the authors provide a sensitivity analysis (Figure 15, page 12), this points to a system that may be complex to tune and potentially fragile in production environments with highly dynamic workload mixes. The paper could benefit from a discussion on the vision for deploying such a system—whether these parameters are "set-and-forget" or would require sophisticated, automated online tuning.
Understated Relationship to Broader OS/Runtime Scheduling: A4 is essentially a resource scheduler for the LLC. It operates based on a static priority ("HPW" vs "LPW"). In many modern datacenters, workload priority is dynamic and managed by cluster schedulators (like Borg or Kubernetes). The paper misses an opportunity to connect its work to this larger ecosystem. How might A4 integrate with a cluster scheduler that might change a workload’s priority or QoS class on the fly?

Questions to Address In Rebuttal

Generalizability of "Directory Contention": Your discovery of directory contention is fascinating and hinges on the specific coupling of data ways and directory ways in the Skylake microarchitecture. Could you elaborate on whether you expect this specific contention to exist in newer Intel CPUs (e.g., Ice Lake, Sapphire Rapids) or in AMD's Zen architectures? Does the core principle—that I/O data touched by a core must have its coherence state tracked in a specific, limited directory structure that may alias with application data—hold more generally, even if the implementation details differ?
Practicality of Tuning A4: The A4 framework's effectiveness relies on several hand-tuned thresholds. In a real-world datacenter, how do you envision these parameters being set and maintained? Would this require expert human intervention for each new server generation or major software stack, or do you believe a simple, robust set of default values exists?
On the "Hidden Knob" for DCA Control: You mention using the perfctrlsts_0 register to selectively disable DCA for storage devices (Section 4.2, page 6). Could you clarify the nature of this control? Is this a documented, officially supported feature for this purpose, or is it an undocumented capability discovered through reverse engineering? The answer would help in assessing the long-term viability and robustness of this part of your solution.

Review 3

Review Form: The Innovator (Novelty Specialist)

Summary

The authors present a work that identifies two allegedly novel Last-Level Cache (LLC) contention sources in modern Intel server CPUs and proposes a runtime management framework, A4, to mitigate them. The first claimed novel insight (C1) is that DMA-written I/O data, upon being accessed by a CPU core, migrates from Direct Cache Access (DCA) ways to specific, hidden "inclusive ways" tied to the LLC's directory structure, causing a new form of contention with non-I/O workloads. The second insight (C2) is that modern high-bandwidth storage-I/O devices pollute the DCA ways, harming co-located network-I/O, while deriving little benefit themselves. The proposed solution, A4, orchestrates existing hardware features, most notably Intel's Cache Allocation Technology (CAT) and a little-known knob for per-device DCA disabling, to implement mitigation policies derived from these insights.

My evaluation is focused solely on whether these contributions represent a genuine advancement over the state of the art.

Strengths

The primary strength of this paper lies in the identification of what appears to be a genuinely new, microarchitecturally-specific contention mechanism.

Novel Insight into Directory Contention (C1): The core novel contribution is the discovery and characterization of "directory contention" as described in Section 3.1 (page 4). While prior art has extensively documented I/O-induced LLC contention (e.g., latent contention [67], DMA bloat [2]), the specific mechanism identified here is new. It builds upon the reverse-engineering work of Yan et al. [65] that exposed the non-inclusive LLC's directory structure, but it goes a step further by identifying the dynamic behavior of I/O cache lines migrating into these specific "inclusive ways". This is a non-obvious performance pathology and represents a true contribution to the community's understanding of modern cache hierarchies.
Novel Application of a Hardware Feature (F2): The use of a runtime-accessible knob to selectively disable DCA for specific I/O devices (Section 4.2, page 7) is a significant and novel engineering contribution. The prevailing understanding is that DCA is typically toggled system-wide via BIOS [22]. Exposing and utilizing a mechanism (reportedly via the perfctrlsts_0 register) for fine-grained, per-port control at runtime is a new capability that enables a class of solutions previously considered infeasible. This is a valuable discovery.
Novel Policies Derived from Insights: The proposed mitigation strategies are not generic; they are tightly coupled to the novel findings. The "n-Overlap" allocation strategy (Section 4.1, page 6), which intentionally allocates I/O workloads to overlap with inclusive ways to maximize caching efficiency, is a clever and non-obvious policy that would not have been conceived without the insight from C1.

Weaknesses

While the work contains kernels of true novelty, several aspects are incremental advancements or applications of well-known concepts to a new problem context.

Incremental Nature of Storage-I/O Contention (C2): The contention between high-bandwidth storage and network I/O in DCA ways (Section 3.2, page 5) is a timely but arguably foreseeable problem. The concept of DCA pollution is not new [18]. The novelty here is in identifying a new, high-bandwidth aggressor (fast NVMe SSDs). Given the known performance characteristics of storage I/O (large blocks, poor temporal locality for streaming reads), its negative interaction with latency-sensitive network packets within a shared resource is an expected outcome. The contribution is one of characterization and quantification, which is valuable, but it lacks the fundamental surprise of the C1 finding.
Established Framework Design Pattern: The overall architecture of A4—a runtime daemon that monitors hardware performance counters and dynamically adjusts LLC partitions using CAT—is a well-established design pattern in the literature for performance management (e.g., [48], [66], [67]). The novelty of A4 is not in its structure, but in the specific rules and heuristics it implements, which are derived from the aforementioned insights. The paper should be careful not to overstate the novelty of the framework itself.
Conceptual Overlap in Mitigation Techniques: The concept of "pseudo LLC bypassing" (Section 5.5, page 8), where antagonistic workloads are allocated a minimal number of "trash ways," is functionally similar to prior work on utility-based partitioning [48] and dead block management [33, 37], which also seek to identify and isolate cache-unfriendly access streams to prevent them from polluting the cache for other, more deserving workloads. The authors' application of this concept to DMA-bloated I/O streams is sound, but the underlying principle is not new.

Questions to Address In Rebuttal

On Directory Contention (C1): The work of Yuan et al. [67] previously identified "latent contention" where non-I/O workloads contend with network-I/O in DCA ways. Please clarify precisely how your "directory contention" differs. Is it a completely separate mechanism, or an evolution of the same root cause now better explained by the non-inclusive directory structure?
On the Per-Device DCA Knob (F2): The ability to disable DCA per-port is a cornerstone of your solution to C2. Is this a formally documented and stable feature in Intel's architecture, or is it an undocumented Model-Specific Register (MSR) that could change or be removed in future silicon? The generality and future-proofing of this key mechanism depend on the answer.
On Pseudo LLC Bypassing: Please explicitly compare your "pseudo LLC bypassing" for antagonistic I/O workloads to prior academic and industry work on identifying and managing cache-unfriendly or "streaming" data, such as utility-based cache partitioning (UCP) or dead block prediction. What is the fundamental delta that makes your approach novel beyond its application to I/O-generated data?

HardHarvest: Hardware-Supported Core Harvesting for Microservices

Abstract

In microservice environments, users size their virtual machines (VMs) for peak loads, leaving cores idle much of the time. To improve core utilization and overall throughput, it is instructive to consider a recently-introduced software technique for ...

ACM LINK

Reviews

Review 1

Title: HardHarvest: Hardware-Supported Core Harvesting for Microservices Reviewer: The Guardian

Summary

The authors propose HardHarvest, a hardware architecture designed to support core harvesting in microservice environments. The paper identifies two primary overheads in existing software-based approaches: the latency of hypervisor-driven core reassignment and the performance penalty from flushing/invalidating private caches and TLBs. To address this, HardHarvest introduces three main hardware features: 1) a hardware request scheduler with per-VM queues to accelerate core reassignment, 2) way-partitioning of private caches/TLBs into "Harvest" and "Non-Harvest" regions to preserve Primary VM state, and 3) a novel replacement algorithm that attempts to steer shared application data into the preserved "Non-Harvest" region. The evaluation, conducted via full-system simulation, claims that HardHarvest significantly reduces Primary VM tail latency by 6.0x and increases Harvest VM throughput by 1.8x compared to a state-of-the-art software baseline.

Strengths

Well-Defined Problem: The paper correctly identifies a significant and timely problem. The overheads associated with software-based core harvesting, particularly in the context of latency-sensitive microservices, are substantial and well-articulated in Section 3. The motivational analysis is convincing in establishing the potential for a hardware-based solution.
Comprehensive Design: The proposed solution is multifaceted, addressing both the control plane (core reassignment) and data plane (cache state) aspects of the problem. This shows a thorough understanding of the challenges involved.
Detailed Breakdown of Gains: The cumulative breakdown of performance improvements in Figure 12 is a methodologically sound way to attribute gains to specific architectural features, which is commendable.

Weaknesses

My primary role is to ensure the rigor and validity of new claims against the established state-of-the-art. In this capacity, I have identified several critical weaknesses that challenge the paper's core conclusions.

Potentially Unfair Baseline Comparison: The central claim of a 6.0x tail latency reduction rests on the comparison against "state-of-the-art software core harvesting" (Harvest-Term), which is modeled after SmartHarvest [88]. However, a crucial feature of SmartHarvest for mitigating tail latency is its use of an "emergency buffer" of idle cores that can be reclaimed instantly without software overhead. The authors mention this feature in Section 4 (Page 4) but it is entirely unclear whether their simulated software baseline (Harvest-Term) actually implements this buffer. Without it, the baseline is significantly weakened, and the comparison is arguably one against a strawman. The dramatic latency reduction could simply be an artifact of crippling the baseline.
Conflation of General-Purpose and Harvesting-Specific Gains: The results in Figure 11 show that HardHarvest not only improves upon software harvesting but also achieves significantly lower tail latency than the NoHarvest baseline. The authors attribute this to "improved cache/TLB replacement and request queuing" (Section 6.1, Page 11). This is a critical issue. It implies that a substantial portion of the claimed benefit is derived from a general-purpose hardware scheduler and a new cache policy, not from the harvesting mechanism itself. The paper is framed as a solution for harvesting, but the evidence suggests it is a paper about a new scheduler that also happens to do harvesting. This conflates two separate contributions and overstates the benefits attributable to the novel harvesting architecture.
The Fragility of the Cache Replacement Heuristic: The proposed replacement algorithm (Algorithm 1, Section 4.2.3) is entirely dependent on a heuristic to classify pages as "shared" or "private." The paper proposes a simple temporal heuristic: memory allocated before server.serve() is shared, and memory allocated after is private. This is a fragile assumption that is unlikely to hold for many real-world microservices that might perform lazy initialization, use dynamic data structures, or rely on just-in-time compilation. The paper presents no sensitivity analysis on the accuracy of this classifier. If pages are frequently misclassified, this "smart" replacement policy could easily degrade into a worst-case scenario, polluting the protected Non-Harvest region with transient data or evicting critical shared data. The entire benefit of cache partitioning rests on this unvalidated heuristic.
Understated Hardware Complexity and Cost: The paper proposes a non-trivial centralized hardware controller (Figure 9) with dedicated network links, multiple Queue Managers, and VM State Registers. The cost analysis in Section 6.8, based on McPAT, focuses almost exclusively on the storage overhead of the request queue and extra cache bits, concluding a mere 0.19% area overhead. This seems to grossly underestimate the true cost. The analysis likely neglects the area and power of the complex control logic for the scheduler, the cross-VM interrupt mechanism, the dynamic queue management, and the priority multiplexers for the new replacement policy. Furthermore, the centralized design presents a potential scalability bottleneck for future processors with hundreds of cores, a concern that is not addressed.

Questions to Address In Rebuttal

The authors must address the following points directly and with specific evidence from their experiments to convince me of the validity of their work.

On the Baseline's Fidelity: Please confirm whether your simulated software baseline (Harvest-Term) implements the emergency core buffer from SmartHarvest [88]. If it does not, please justify why this is still a fair "state-of-the-art" comparison and how your conclusions would change if it were included.
On Deconflating Contributions: Can you provide an evaluation that isolates the performance benefit derived purely from the harvesting-specific mechanisms (i.e., cache partitioning and the associated replacement policy) from the benefits of the general-purpose hardware request scheduler and queues? For example, by comparing NoHarvest against a NoHarvest+HardHarvestScheduler configuration.
On the Replacement Heuristic's Robustness: What is the performance impact on a Primary VM if the shared/private page classification heuristic has a high error rate (e.g., 25% or 50% misclassification)? Please provide a sensitivity analysis or a more robust defense of why this simple heuristic is sufficient for a general-purpose hardware mechanism.
On Hardware Cost Realism: Can you provide a more detailed breakdown of the hardware cost analysis that includes the control logic for the scheduler, queue managers, and replacement policy, not just the storage elements? Please also comment on the scalability of the centralized controller design to processors with significantly more than 36 cores.

Review 2

Paper: HardHarvest: Hardware-Supported Core Harvesting for Microservices Reviewer: The Synthesizer

Summary

This paper presents HardHarvest, a novel hardware architecture designed to enable efficient core harvesting in microservice environments. The authors identify a critical bottleneck in existing software-based harvesting techniques (e.g., SmartHarvest): the high overhead of reassigning cores between Virtual Machines (VMs), which involves hypervisor calls and extensive cache/TLB flushes. These overheads, while tolerable for monolithic applications, are prohibitive for latency-sensitive microservices that operate on sub-millisecond timescales.

HardHarvest's core contribution is a holistic, hardware-first solution that tackles these overheads directly. It proposes a two-pronged approach: 1) a hardware-based request scheduling system with per-VM queues that allows cores to be reassigned between Primary (latency-critical) and Harvest (batch) VMs without hypervisor intervention, and 2) a hardware-managed partitioning of private caches and TLBs into "Harvest" and "Non-Harvest" regions. This partitioning scheme preserves the Primary VM's "hot" state in the Non-Harvest region during harvesting, dramatically reducing the cold-start penalty upon core reclamation. The evaluation, performed via full-system simulation, demonstrates significant benefits, most notably reducing the tail latency of Primary VMs by 6.0x compared to software harvesting, to the point where it is even better than a non-harvesting baseline, while simultaneously increasing Harvest VM throughput by 1.8x.

Strengths

This work's primary strength lies in its excellent synthesis of a critical industry problem with a well-motivated and elegant architectural solution.

Timely and Significant Problem: The paper addresses a problem of immense practical and economic importance. The tension between provisioning for peak load (leading to low average utilization) and meeting strict tail-latency SLOs is a central challenge in modern cloud infrastructure. The authors correctly identify that existing software solutions for resource harvesting, developed in the context of longer-running applications, are a poor fit for the microsecond-scale world of microservices. This work is therefore not just an academic exercise but a direct response to a real-world architectural gap.
A Conceptual Leap in Core Harvesting: The move from software to hardware for this task is a significant conceptual advance. While prior work has proposed hardware support for microservice scheduling (e.g., µManycore [76], RPCValet [15]), HardHarvest is, to my knowledge, the first to propose a comprehensive hardware framework for resource elasticity at this granularity. It connects the dots between datacenter economics and microarchitecture in a compelling way.
Holistic and Co-Designed Solution: The true novelty is not in any single component but in the co-design of the two main features. The hardware scheduler solves the core reassignment latency, while the cache partitioning solves the resulting microarchitectural cold-start problem. This integrated approach shows a deep understanding of the full performance stack, from the hypervisor down to the cache replacement policy. The proposed replacement algorithm (Algorithm 1, page 9), which attempts to steer shared vs. private pages, is a particularly thoughtful refinement that demonstrates this depth.
Strong Quantitative Motivation: The "Motivation" section (Section 3, pages 3-4) is exemplary. By systematically measuring and presenting the overheads of hypervisor calls (Figure 4) and cache flushes (Figure 5), the authors build an undeniable case for a hardware-level intervention. This data-driven motivation makes the subsequent design feel necessary rather than contrived.

Weaknesses

The weaknesses of the paper are primarily related to the broader systems context and potential complexities that are not fully explored.

System Complexity and Interaction: The paper presents HardHarvest as a self-contained module. In a real processor, this logic would need to interact with a host of other complex features: hardware security enforcers, performance monitoring units, NUMA optimizations, and emerging protocols like CXL. For instance, how would the hardware scheduler's decisions interact with the OS/hypervisor's broader power management or process scheduling policies? The paper could be strengthened by acknowledging and briefly discussing these integration challenges.
Security Implications Beyond Flushing: The security model is predicated on flushing the "Harvest" region of the caches to prevent direct data leakage. While this is a necessary first step, the design—which intentionally preserves a Primary VM's state on a core while a Harvest VM executes—creates a novel scenario for side-channel analysis. Could rapid, hardware-managed switching create new timing channels related to resource contention in the core's backend or the control logic of the cache partitioning itself? The brief mention of adding a delay to prevent timing channels (page 8) feels insufficient for a system operating at this level of intimacy between tenants.
Robustness of Workload Assumptions: The efficacy of the advanced cache replacement policy hinges on the ability to reliably distinguish between "shared" (e.g., application code, read-only data) and "private" (request-specific) memory pages. The proposed heuristic (based on allocation time relative to a server.serve() call) is clever and likely works for the C++/gRPC-style services evaluated. However, its applicability to microservices built in other ecosystems (e.g., Java with a JIT, Go with its own runtime scheduler) is not obvious. The performance of the system could be sensitive to the accuracy of this classification, making this a potential point of fragility.

Questions to Address In Rebuttal

On System Integration: Could the authors comment on the potential interactions between the HardHarvest controller and other modern SoC features, such as Intel's Thread Director or system-wide power management policies? Is there a risk of the HardHarvest hardware scheduler and the OS/hypervisor schedulers working at cross-purposes?
On Side-Channel Security: The paper's security model relies on flushing the Harvest region. Could the authors elaborate on potential timing channels or other side-channels that might arise from a Primary VM's state being preserved in the Non-Harvest region while a Harvest VM is executing on the same core? For example, could the Harvest VM infer information by observing the performance effects of the replacement policy (Algorithm 1) acting on the Primary VM's hidden state?
On the Generality of the Replacement Policy: The efficacy of the specialized replacement policy hinges on accurately identifying shared pages. How sensitive are the performance benefits to the accuracy of this heuristic? Have the authors considered how this heuristic would apply to microservices built with different frameworks (e.g., those using garbage collection or JIT compilation) where the distinction between shared and private allocations may be less clear?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper proposes HardHarvest, a hardware architecture designed to support core harvesting specifically for microservice environments. The authors' central claim is that this is the first hardware-based solution to this problem, addressing the prohibitive overheads of existing software-based approaches. The core of the proposed architecture consists of two main components: 1) a hardware-based request scheduler that manages core re-assignment between "Primary" (latency-critical) and "Harvest" (batch) VMs without hypervisor intervention, and 2) a hardware mechanism for partitioning private caches/TLBs, coupled with a novel replacement algorithm (Algorithm 1, page 9) that aims to preserve the Primary VM's "warm" state by prioritizing shared pages in a protected partition. The authors claim their solution significantly improves core utilization and Harvest VM throughput while reducing Primary VM P99 tail latency by 6.0x compared to state-of-the-art software harvesting.

Strengths

The primary strength of this paper lies in its novel synthesis of architectural concepts to solve a well-motivated problem. While individual components of the solution have roots in prior work, their combination and specific application to inter-VM core harvesting for microservices appears to be genuinely new.

Novel Problem-Architecture Mapping: The core idea of creating a dedicated hardware architecture for core harvesting is, to my knowledge, novel. Prior art has focused on software techniques for harvesting (e.g., SmartHarvest [88]) or hardware acceleration for RPC scheduling (e.g., µManycore [76], ALTOCUMULUS [96]), but not the explicit, hardware-managed, dynamic lending and reclamation of cores between different VM types.
Novel Cache/TLB Replacement Policy: The most distinct and novel technical contribution is the proposed cache/TLB replacement algorithm (Section 4.2.3). The idea of partitioning the cache is not new (e.g., Intel CAT), but the policy itself is. It introduces awareness of both the region type (Harvest vs. Non-Harvest) and the page type (shared vs. private) into the victim selection process. This fine-grained, policy-driven management to selectively preserve a Primary VM's working set across preemption is a clever microarchitectural technique that I have not seen proposed in prior work.
Significant Delta Over Prior Art: The proposed hardware scheduler extends prior work on RPC schedulers by adding logic specific to harvesting: it is aware of VM types (Primary/Harvest), manages core "loans," and handles the preemption and reclamation protocol (Section 4.1.5). This is a non-trivial delta that makes it fundamentally different from a simple request scheduler.

Weaknesses

My critique focuses on precisely delineating the novel contributions from the integration of existing ideas, and questioning the robustness of the assumptions upon which the novelty rests.

Conflation of Novelty with Integration: The paper presents HardHarvest as a monolithic novel architecture. However, it is more accurately an integration of several concepts, some novel and some drawn directly from prior work. For example, the fast context switching is attributed to µManycore [76], and the need for an efficient flush/invalidate mechanism is acknowledged as a known problem with existing solutions [30, 51]. The impressive 6.0x tail latency reduction is therefore a result of the entire system, and it is difficult to isolate the benefit derived purely from the novel components (the VM-aware scheduler and the replacement policy) versus the benefit from simply implementing previously-proposed hardware accelerations for context switching and cache management. The breakdown in Figure 12 (page 11) is cumulative, which makes it hard to assess the standalone value of the truly new ideas.
Novelty Reliant on a Heuristic: The effectiveness of the novel replacement policy (Algorithm 1) is entirely dependent on a software heuristic for classifying pages as "shared" or "private" (Section 4.2.2, page 8). The paper suggests a simple rule: data allocated before server.serve() is shared. While plausible for the frameworks studied, this is not a hardware novelty and may be fragile. The paper does not sufficiently explore the generalizability of this heuristic. If the heuristic fails, the core benefit of the novel replacement algorithm is nullified, even if the hardware is implemented perfectly.
Complexity vs. Benefit of the Novel Components: The proposed hardware is significant, requiring a centralized controller, dedicated network, and modifications to every core's cache/TLB replacement logic. While the overall performance gain is large, it is unclear how much of that gain is attributable to the new, complex parts of the design. A simpler design that implements hardware context switching and static cache partitioning (using existing techniques) might achieve a substantial portion of these gains. The paper lacks an ablation study to justify the added complexity of its novel scheduler and replacement policy over simpler hardware extensions.

Questions to Address In Rebuttal

The central claim is that HardHarvest is the "first" architecture for core harvesting in hardware. While it appears novel in the context of mainstream CPU architecture, can the authors elaborate on whether similar concepts of hardware-managed resource borrowing/lending have been explored in adjacent fields, such as reconfigurable computing (FPGAs) or specialized manycore processors, even if not explicitly termed "core harvesting"?
To better isolate the paper's novel contributions, can the authors provide a more direct "apples-to-apples" comparison? For instance, what would the performance be with a baseline that includes previously-proposed hardware context switching and efficient flushing, but uses a simpler hardware scheduler (not VM-aware) and standard cache partitioning (e.g., static ways via CAT) without the novel replacement policy? This would help quantify the specific benefit of the new ideas presented in this work.
The novel replacement policy's success is predicated on the shared/private page classification heuristic. How would the performance of HardHarvest degrade if this heuristic were, for instance, only 50% accurate? A sensitivity analysis on the accuracy of this software-level assumption would strengthen the claims about the hardware's utility.

MoPAC: Efficiently Mitigating Rowhammer with Probabilistic Activation Counting

Abstract

Rowhammer has worsened over the last decade. Existing in-DRAM solutions, such as TRR, were broken with simple patterns. In response, the recent DDR5 JEDEC standards modify the DRAM array to enablePer-Row Activation Counters (PRAC)for tracking aggressor ...

ACM LINK

Reviews

Review 1

Paper Title: MOPAC: Efficiently Mitigating Rowhammer with Probabilistic Activation Counting Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper identifies the significant performance overhead (cited as 10% on average) of the JEDEC PRAC (Per-Row Activation Counter) mechanism, which is caused by the extended DRAM timings required for every activation's counter update. To mitigate this overhead, the authors propose MoPAC (Mitigating Rowhammer with Probabilistic Activation Counts), a scheme that performs these counter updates probabilistically rather than deterministically. By updating counters for only a small fraction of activations (with probability p), the performance penalty is incurred less frequently. To maintain security guarantees, MoPAC revises the Alert Threshold (ATH) downwards to a new value, ATH*, which is calculated to ensure a target Mean Time To Failure (MTTF) against attacks. The paper presents two implementations: MoPAC-C, a memory controller-side solution requiring two precharge command variants, and MoPAC-D, a fully in-DRAM solution using a small queue (SRQ) to buffer pending updates. The authors claim that for a Rowhammer threshold (TRH) of 500, MoPAC-C and MoPAC-D reduce the 10% PRAC slowdown to 1.7% and 0.7%, respectively.

Strengths

Well-Motivated Problem: The paper correctly identifies a critical and practical barrier to the adoption of PRAC. The 10% average performance overhead demonstrated in Figure 2 is a significant cost for a security feature, and addressing this is a worthwhile endeavor.
Novel Core Concept: The central idea of trading deterministic, high-cost checks for probabilistic, low-cost checks (compensated by a more sensitive threshold) is a clever approach to amortizing the performance penalty of PRAC.
Comprehensive Proposal: The authors present two distinct and well-reasoned implementations (MoPAC-C and MoPAC-D) that explore different points in the design space (MC vs. DRAM-centric), demonstrating a thorough consideration of potential deployment paths.

Weaknesses

My primary responsibility is to ensure the rigor and correctness of the work. Despite its clever premise, the paper's security claims rest on a foundation with several critical, unaddressed flaws.

Fundamentally Unjustified Independence Assumption in Security Model: The entire security analysis hinges on the derivation of the acceptable row failure probability, ε. In Section 5.3 (page 7), for a double-sided Rowhammer attack, the paper claims the overall escape probability is Pe2 = Pe1 * Pe1, where Pe1 is the probability of a single aggressor escaping mitigation. This implies that the probabilistic events of failing to mitigate the first aggressor and failing to mitigate the second aggressor are statistically independent. This is a monumental and entirely unsubstantiated assumption. System-level phenomena such as voltage noise, temperature fluctuations, or weaknesses in a hardware pseudo-random number generator could easily introduce correlation. A single period of low entropy from the RNG could cause a burst of missed updates for both aggressors simultaneously, completely invalidating the Pe1 = sqrt(F) derivation in Equation 5. Without a rigorous justification for this independence assumption, the entire security proof is unsound.
Inadequate Analysis of Performance Attacks: The analysis of Denial-of-Service attacks in Section 7 is superficial and relies on average-case behavior, which is insufficient for security analysis. The multi-bank attack model (Section 7.2) calculates slowdown based on an average α value of 0.55, derived from a Monte Carlo simulation. An adversary will not target the average case; they will seek to engineer the worst case. The paper provides no analysis of the distribution of α or its minimum possible value. An attacker could potentially find patterns that synchronize the probabilistic updates across banks to achieve a much lower α, leading to a far more effective DoS attack than the paper suggests. A security analysis must be based on worst-case, not average-case, scenarios.
Unaddressed Timing Side-Channel Vulnerability: The threat model in Section 2.1 is incomplete. MoPAC-C introduces two precharge commands, PRE and PREcu, with different latencies (as implied by Figure 6, where PRE uses tRP (norm) and PREcu uses tRP (PRAC)). This creates a classical timing side-channel. An attacker can precisely measure the precharge latency to determine whether a counter update occurred or not for a given activation. This leaks the outcome of the random number generator. A sophisticated attacker could use this information to adapt their activation patterns, waiting for a series of non-updates (confirmed via the side-channel) before launching a burst of activations, thereby defeating the probabilistic premise. The paper completely fails to acknowledge or analyze this vulnerability.
Discrepancy in Performance Claims and Understated Costs: The abstract and introduction prominently feature the impressive 0.7% slowdown for MoPAC-D at TRH=500. However, Appendix-A reveals that when protecting against Row-Press, the slowdown for MoPAC-D at TRH=500 balloons to 6.8%. Row-Press is a known, practical attack, and omitting its impact from the headline results is misleading. Furthermore, the paper neglects any discussion of the area, power, and, most importantly, verification complexity of the logic required for MoPAC-D's in-DRAM SRQ, its counters, and the associated scheduling/draining logic. Dismissing the cost by comparing SRAM bytes to TRR is insufficient.

Questions to Address In Rebuttal

Please provide a rigorous justification for the assumption in Section 5.3 that the probabilistic mitigation failures for the two aggressor rows in a double-sided attack are statistically independent events. What happens to your security guarantees if a weak RNG or systemic noise introduces even a small correlation?
Your performance attack analysis in Section 7.2 relies on an average-case parameter α=0.55. Please provide a worst-case analysis. What is the minimum achievable α an attacker could force, and what is the corresponding DoS impact?
Please address the timing side-channel inherent in MoPAC-C's design (differing latencies for PRE and PREcu). How do you prevent an attacker from measuring this latency to learn the outcome of the probabilistic choice and adapt their attack strategy accordingly?
Please clarify the discrepancy between the 0.7% slowdown for MoPAC-D claimed in the abstract and the 6.8% slowdown reported in Appendix-A when defending against Row-Press. Given that Row-Press is a realistic threat, which of these numbers is the more accurate representation of the mechanism's real-world performance overhead?

Review 2

Reviewer Persona: The Synthesizer (Contextual Analyst)

Summary

This paper addresses a critical and timely problem in hardware security: the significant performance overhead of the JEDEC-proposed Per-Row Activation Counting (PRAC) mechanism for mitigating Rowhammer. While PRAC offers a principled and robust defense against even future, low-threshold attacks, its ~10% average performance penalty—incurred by extending DRAM timings on every activation—presents a major barrier to its widespread adoption.

The authors' core contribution is a simple yet profoundly effective idea: decouple the high cost of counter updates from every memory activation by performing them probabilistically. Termed MoPAC (Mitigating Rowhammer with Probabilistic Activation Counts), their approach only incurs the latency penalty for a small fraction of activations. To maintain security, the mechanism compensates for the sparse sampling by increasing the counter's increment value and carefully lowering the mitigation threshold (referred to as ATH*). The paper presents a rigorous security analysis to derive these safe parameters based on a target Mean Time to Failure (MTTF). Two practical implementation variants are proposed: MoPAC-C (a memory controller-side solution) and MoPAC-D (a fully in-DRAM solution), demonstrating a clear understanding of the hardware ecosystem. The results are compelling, showing that MoPAC reduces the 10% overhead of PRAC to less than 2% for a near-future Rowhammer threshold of 500.

Strengths

Addresses a Crucial, Real-World Bottleneck: The primary strength of this paper is its direct and effective targeting of what is arguably the single largest impediment to the adoption of principled in-DRAM Rowhammer defenses. The academic and industrial consensus is moving towards exhaustive tracking mechanisms like PRAC. This work doesn't fight that trend; instead, it makes the trend practical. By transforming PRAC from a theoretically sound but performant-prohibitive solution into a viable one, the work has the potential for significant real-world impact.
Elegant and Parsimonious Core Idea: The central concept of applying probabilistic sampling is elegant. It doesn't require a radical redesign of the DRAM array or a complex new tracking structure. Instead, it applies a well-understood technique from computer science (probabilistic counting/sampling) to a new domain with excellent results. This parsimony is a hallmark of strong systems research. The further refinement presented in Section 8 (Non-Uniform Probability) is a thoughtful extension that shows how the core idea can be optimized even more by observing real-world access patterns.
Connects Security Theory with System Performance: The authors do not simply propose a heuristic. The security analysis in Section 5.3 is the lynchpin of the paper's credibility. By grounding the choice of probability p and the revised threshold ATH* in a formal model based on acceptable failure rates (Bank-MTTF), they provide a framework for reasoning about the security-performance trade-off. This principled approach is essential for any security mechanism to be taken seriously.
Well-Positioned within the Literature: The paper does a good job of situating itself in the broader context of Rowhammer mitigation. It correctly identifies the failure of early ad-hoc solutions like TRR and acknowledges the high storage cost of other principled trackers (ProTRR, Mithril), thereby motivating the industry's shift to PRAC. Crucially, in Section 9.2, it provides a compelling comparison against other low-cost trackers like MINT and PrIDE, arguing that MoPAC's probabilistic counter updates are a more efficient use of limited mitigation time "borrowed" from the refresh budget. This contextualizes MoPAC not just as a PRAC-optimizer, but as a competitive design in the broader space of low-overhead solutions.

Weaknesses

My criticisms are less about fundamental flaws and more about the boundaries and practical implications of the proposed idea.

Potential for a New Performance Attack Vector: While the authors analyze performance attacks in Section 7, the analysis is based on a simplified model. Introducing randomness creates a new surface for an adversary to potentially exploit. An attacker could craft access patterns to try and force worst-case scenarios for the probabilistic mechanism, leading to more frequent ABO stalls than with a deterministic system. The paper concludes the impact is less severe than other contention attacks, which is a fair point, but the nature of this new probabilistic vulnerability could be explored more deeply.
Understates the Challenge of Standardization: The paper proposes modifications that, while conceptually simple, would require changes to the JEDEC standard. MoPAC-C requires two types of precharge commands, and MoPAC-D requires new state and logic within the DRAM chip itself. The process of amending memory standards is notoriously complex and slow, involving consensus from multiple competing vendors. The paper could benefit from a more explicit discussion of the path to standardization and the potential pushback or implementation challenges from SoC and DRAM vendors.
The Fate of MoPAC is Tied to PRAC: The proposal's relevance is entirely contingent on the industry's continued commitment to the PRAC framework. If a different, unforeseen mitigation strategy emerges that supplants PRAC (e.g., a breakthrough in DRAM cell technology that is inherently Rowhammer-resistant, or a radically different low-cost tracker), then this work, while clever, would be solving a problem that no longer exists. This is not a flaw in the work itself, but a contextual risk regarding its long-term impact.

Questions to Address In Rebuttal

Regarding the performance attack analysis (Section 7), the throughput loss model 7/(N+7) is a useful first-order approximation. Could the authors elaborate on how this model might change in the face of more complex, multi-bank attack patterns that are specifically designed to exploit the probabilistic nature of MoPAC? How confident are the authors that this simplified model captures the true worst-case performance degradation?
Could you elaborate on the practical path to adoption for MoPAC-C and MoPAC-D? What specific changes would be needed in the JEDEC DDR5/DDR6 specifications? For MoPAC-D, what is the estimated silicon area/complexity cost for the per-bank SRQ and associated logic, and how might this affect DRAM vendor cost models?
The core philosophy of this paper is to trade deterministic, high-cost security operations for probabilistic, low-cost ones while maintaining a quantifiable security guarantee. Do you see this philosophy being applicable to other areas of hardware security beyond Rowhammer? For example, could similar probabilistic techniques be used to reduce the overhead of memory encryption/integrity counters or other security monitoring features?

Review 3

Review Form: Innovator (Novelty Specialist)

Summary

The paper identifies the significant performance overhead (~10%) of the emerging JEDEC PRAC (Per-Row Activation Counter) standard for Rowhammer mitigation as a major barrier to its adoption. The authors propose MoPAC (Mitigating Rowhammer with Probabilistic Activation Counting), a mechanism to drastically reduce this overhead. The core idea is to perform the costly PRAC counter updates not on every activation, but probabilistically with a small probability p. By doing so, the performance penalty is incurred only for a small fraction of activations. To maintain security guarantees, the proposal includes a rigorous statistical analysis to derive a new, lower Alert Threshold (ATH*) that accounts for potential undercounting due to sampling. The authors present two implementations: a memory-controller-side version (MoPAC-C) and a fully in-DRAM version (MoPAC-D).

Strengths

The primary strength of this work lies in its core conceptual contribution: the application of probabilistic counting to the PRAC framework. While PRAC itself is an industry-led evolution of per-row counting concepts (Intel Patent [9], Panopticon [2]), its high, deterministic overhead has been a known, unsolved problem. This paper presents a clean, elegant, and seemingly novel solution to that specific problem.

Novel Problem-Solution Pairing: The application of probabilistic methods to reduce the overhead of an exhaustive, per-row tracking mechanism appears to be new. Prior art has explored sampling for limited-entry trackers (e.g., MINT [32] uses deterministic sampling to populate a small tracker) or randomization of the physical space (e.g., RRS [35] randomizes row mappings). However, MoPAC's approach of managing the update frequency of all counters in a full PRAC system is a distinct and novel contribution. It directly targets the latency of the read-modify-write operation inherent to PRAC, which previous academic work on trackers did not need to address.
Novel Security Framework for a Probabilistic Defense: A probabilistic mechanism is insecure without a model to bound the failure rate. The security analysis presented in Section 5.3 (page 7) is a crucial and novel component of this work. Deriving the acceptable failure probability (ε) from a target Bank-MTTF and then using the binomial distribution to calculate the required number of critical updates (C) and the revised ATH* provides the theoretical foundation that makes the probabilistic approach viable and trustworthy. This moves the idea from a simple heuristic to a principled defense.
Exploration of the Design Space: The proposal of two distinct implementations (MoPAC-C and MoPAC-D) demonstrates a thoughtful exploration of how this novel concept could be integrated into the existing system stack. This is not merely an algorithmic proposal but a tangible architectural one, considering the practical trade-offs between modifying the memory controller and adding complexity within the DRAM chip.

Weaknesses

My critique is centered on the precise boundaries of the novelty, as the underlying components are not new in isolation.

Incrementalism of NUP: The core intellectual contribution is the introduction of uniform probability to PRAC updates. The extension to Non-Uniform Probability (NUP) in Section 8 (page 12), while clever and effective at further reducing overhead, feels like a logical optimization rather than a separate novel concept. The fundamental leap is made in proposing MoPAC; NUP is an incremental refinement on top of it.
Derivative Implementation Primitives: The two proposed implementations, MoPAC-C and MoPAC-D, are logical consequences of the central probabilistic concept. MoPAC-C's requirement for two precharge commands (Section 5.1, page 6) is a simple extension of existing JEDEC capabilities (e.g., PRE vs. PRES). MoPAC-D's in-DRAM queue (SRQ) is a standard architectural pattern. While these are the first proposed implementations for this specific purpose, the building blocks themselves are not novel. The novelty resides entirely in the reason for their use, not the structures themselves.

Questions to Address In Rebuttal

Distinction from MINT [32]: The paper cites MINT, which uses deterministic sampling (1/p) to select which row activations populate a small, centralized tracker. MoPAC uses probabilistic sampling (p) to decide when to update an existing, dedicated per-row counter. Could the authors explicitly articulate why their probabilistic approach is a conceptually distinct and more powerful primitive than the sampling mechanism used in MINT, particularly in the context of the PRAC framework which already provides exhaustive tracking infrastructure?
Prior Art in Probabilistic Hardware Counting: Probabilistic counting is a foundational computer science concept (e.g., Morris's algorithm). While its application to PRAC appears novel, has the literature search included a thorough review of patents and technical reports related to probabilistic or approximate event counting for performance monitoring units (PMUs) or other hardware counters in memory systems? I am interested in whether a functionally identical mechanism has been proposed for a different purpose (e.g., managing power or wear-leveling).
Robustness of the Probabilistic Assumption: The security analysis in Section 5.3 (page 7) hinges on the assumption that the selection of an activation for a counter update is a truly independent random event. In a physical implementation, this would rely on a pseudo-random number generator (PRNG). Could a sophisticated attacker, through carefully crafted memory access timing, potentially discover the state of the PRNG or create access patterns that correlate with its output? Such an attack could break the independence assumption and allow for more undercounting than the binomial model predicts. What is the assumed complexity or quality of the PRNG required to uphold the security claims?

PuDHammer: Experimental Analysis of Read Disturbance Effects of Processing-using-DRAM in Real DRAM Chips

Abstract

Processing-using-DRAM (PuD) is a promisingparadigmfor alleviating the data movement bottleneck using a DRAM array’s massive internal parallelism and bandwidth to execute very wide data-parallel operations. Performing a PuD operation involves activating...

ACM LINK

Reviews

Review 1

Paper Title: PuDHammer: Experimental Analysis of Read Disturbance Effects of Processing-using-DRAM in Real DRAM Chips Reviewer: The Guardian

Summary

The authors present an experimental characterization of read disturbance vulnerabilities arising from multiple-row activation patterns, which are foundational to many Processing-using-DRAM (PuD) operations. They term this new vulnerability class "PuDHammer" and characterize two variants: Consecutive Multiple-Row Activation (CoMRA) and Simultaneous Multiple-Row Activation (SiMRA). Using a testbed of 316 commercial DDR4 chips, they claim that these PuDHammer patterns significantly reduce the hammer count required to induce a bitflip (HC_first) compared to conventional RowHammer. Furthermore, they claim PuDHammer bypasses an in-DRAM TRR mitigation mechanism. Finally, they adapt the industry-standard PRAC mitigation and evaluate its performance overhead, concluding it to be substantial.

Strengths

Extensive Experimental Scale: The characterization is performed on a large and diverse set of 316 real DDR4 chips from four major manufacturers. This scale lends statistical weight to the general findings regarding CoMRA.
Comprehensive Parameter Sweep: The authors have diligently characterized the effects of various operational parameters, including temperature, data patterns, and timing violations. This provides a valuable dataset for the community.
Important Core Observation: The fundamental finding that multiple-row activation patterns, particularly CoMRA, exacerbate read disturbance is well-demonstrated across all tested vendors and represents a crucial consideration for the future security and reliability of PuD-enabled systems.

Weaknesses

My primary concerns with this manuscript center on the generalizability of its most dramatic claims, a lack of deep mechanistic explanation, and the realism of the mitigation evaluation.

Gross Overstatement and Lack of Generalizability for SiMRA: The paper's most striking claim—a 158.58x reduction in HC_first (Abstract, pg. 1)—is derived from SiMRA. However, the authors concede that SiMRA-induced bitflips were only observed in SK Hynix chips (Section 5.3, pg. 8). To present a vendor-specific behavior as a headline result for a phenomenon named "PuDHammer" is misleading. The abstract and introduction frame the results as general, but the reality is that the most potent attack vector demonstrated is highly limited in its applicability. The footnote explanation hypothesizing that other chips "ignore" commands is conjecture and insufficient justification. This is a critical flaw that undermines the paper's headline contributions.
Unsubstantiated Claim of TRR Bypass: The authors claim that PuDHammer "bypasses an in-DRAM RowHammer mitigation mechanism called Target Row Refresh" (Abstract, pg. 1). However, the corresponding experiment in Section 7 (pg. 11) was conducted on a single tested SK Hynix DRAM module. A sample size of N=1 is scientifically indefensible for making such a strong and general claim. This result may be an artifact of that specific module's TRR implementation and cannot be generalized to TRR mechanisms as a whole. The claim is severely overstated.
Superficial Physical-Level Analysis: The paper is strong on what happens but critically weak on why. For instance, Observation 14 (pg. 9) notes that SiMRA and RowHammer have opposite bitflip directions. The authors "hypothesize that double-sided SiMRA has a different underlying silicon-level mechanism" but provide no further evidence, simulation, or rigorous analysis. This is a significant finding that demands a more thorough investigation than mere speculation. Without a plausible physical model, the observations, while interesting, lack the explanatory power expected of a top-tier publication.
Questionable Realism in Mitigation Evaluation: The evaluation of the adapted PRAC mechanism (Section 8.2, pg. 12-13) relies on a synthetic workload injected into a simulation. This workload consists of "back-to-back one SiMRA with 32-row activation and one CoMRA operation every N ns." It is not clear that this pattern is representative of any real-world PuD application. The reported performance overheads are therefore entirely coupled to an artificial workload whose relevance is not justified. The conclusion that PRAC incurs high overheads is only valid for this specific, potentially unrealistic, workload.
Conflation of Distinct Phenomena: The paper bundles CoMRA and SiMRA under the single moniker "PuDHammer." However, the data suggests they may be mechanistically distinct: CoMRA works across vendors while SiMRA appears specific to one; they exhibit different sensitivities to parameters; and SiMRA demonstrates an opposite bitflip directionality. This lumping may obscure important underlying physics and misleads the reader into viewing them as two sides of the same coin, when they may be fundamentally different vulnerabilities.

Questions to Address In Rebuttal

The authors must address the following points directly and with evidence:

SiMRA Generalizability: Please justify presenting the SiMRA results, particularly the 158.58x HC_first reduction, as a primary finding of the paper when it was only observed in a single manufacturer's chips. The paper should be reframed to clearly demarcate general phenomena (like CoMRA) from vendor-specific ones (like SiMRA), starting from the abstract.
TRR Bypass Evidence: How can the authors justify their strong and general claim about bypassing TRR based on an experiment on a single DRAM module? Please provide data on more modules from different vendors or significantly walk back this claim to reflect the extremely limited scope of the experiment.
Physical Mechanism: The paper offers a hypothesis for the opposite bitflip directionality of SiMRA (Observation 14, pg. 9). Can the authors provide any supporting evidence from device-level modeling, literature on charge transport, or other analysis to elevate this from conjecture to a scientifically plausible explanation?
Workload Realism: Please provide a clear justification for the choice of the synthetic PuD workload used in the PRAC evaluation (Section 8.2, pg. 13). How does this access pattern map to the operational behavior of known PuD applications (e.g., in-memory databases, graph processing)? Without this, the performance results are of questionable value.

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents the first systematic, experimental characterization of a new and severe form of DRAM read disturbance, which the authors aptly name "PuDHammer." The core contribution is the discovery that multiple-row activation—the fundamental mechanism enabling many Processing-using-DRAM (PuD) operations like in-DRAM copy and bitwise logic—dramatically exacerbates DRAM's vulnerability to read disturbance, far more so than conventional RowHammer attacks.

Through an extensive study on 316 commercial DDR4 chips, the authors demonstrate that PuDHammer can induce bitflips with up to a 158.58x lower activation count (HC_first) compared to the most effective RowHammer patterns. They show this vulnerability is sensitive to various operational parameters, can bypass existing in-DRAM mitigations like TRR, and that adapting the current industry standard mitigation (PRAC) results in prohibitively high performance overheads. This work effectively establishes a critical new bridge between the fields of Processing-in-Memory (PIM) and hardware security/reliability, revealing a potentially fundamental challenge to the safe and secure deployment of PuD systems.

Strengths

Fundamental and Timely Contribution: This paper sits at the crucial intersection of two highly active research areas. The PIM community has been largely focused on demonstrating functionality and performance, often with the implicit assumption of a reliable substrate. The hardware security community has studied RowHammer extensively but has not considered the unique access patterns of PIM. This work is the first to connect these domains, identifying a problem that is not merely incremental but represents a step-function increase in vulnerability severity. The discovery is foundational for any future work on secure PIM.
Impressive Experimental Rigor: The scale and thoroughness of the experimental study are a major strength. Characterizing the effect across 316 chips from four major manufacturers (as detailed in Table 1, page 4) provides strong evidence that PuDHammer is a universal phenomenon in modern DRAM, not an artifact of a specific device. The systematic exploration of a wide parameter space (temperature, data patterns, timing, spatial location) gives the results significant depth and credibility.
Clear Real-World Implications: By demonstrating that PuDHammer bypasses an existing mitigation (TRR, Section 7) and that adapting the industry standard PRAC incurs crippling overheads (48.26% average, Section 8.2), the paper immediately grounds its findings in practical reality. This is not a theoretical vulnerability; it is a clear and present danger to the security and isolation properties of future systems that might deploy PuD. This analysis effectively issues a challenge to the community: existing solutions are not sufficient.
Opening New Research Avenues: This is precisely the kind of paper that seeds an entire subfield. It raises more questions than it answers, which is a hallmark of significant research. Future work can now explore the underlying device physics, design PIM-aware mitigation strategies, develop new attack methodologies based on PuDHammer, and create secure compilers or runtimes for PIM systems. The paper provides a solid empirical foundation upon which this future work can be built.

Weaknesses

While the paper is strong, its focus as a "first characterization" study leaves some areas less developed, which is understandable but worth noting.

Limited Exploration of Root Cause: The paper does an excellent job characterizing what happens but is naturally speculative about why PuDHammer is so much more effective than RowHammer. The authors hypothesize about enhanced "trap-assisted electron migration" (Observation 2, page 6) but, as a systems paper, do not provide device-level analysis. While they rightly call for future work here, the paper would be even more impactful with a more detailed theoretical model or preliminary simulation to support their hypothesis.
Mitigation Proposals are Preliminary: The analysis of PRAC is excellent for showing the scale of the problem. However, the proposed new countermeasures in Section 8.1 are largely conceptual sketches. This is not a significant flaw, as a full mitigation design is likely beyond the scope of a single characterization paper. Still, the transition from problem identification to solution feels more like a pointer to future work than a complete contribution in its own right.

Questions to Address In Rebuttal

On the Physics of PuDHammer: The 158.58x reduction in HC_first is a staggering figure. Beyond the general hypothesis of enhanced electron migration, could you elaborate on the potential physical mechanisms? For example, with simultaneous multiple-row activation (SiMRA), is it possible that the massive, concurrent current draw on the power delivery network is causing voltage droops that are the primary driver of vulnerability, as opposed to purely cell-to-cell coupling effects? How does this differ from the mechanism in consecutive activation (CoMRA)?
On the Future of PIM Architecture: Your findings paint a challenging picture for the PIM paradigm. Do you view PuDHammer as a fundamental roadblock that might render certain PuD techniques (especially those based on SiMRA) impractical for secure systems, or do you see it as a new set of engineering constraints that can be designed around? In other words, is this an architectural "deal-breaker" or a "call to arms" for more robust PIM designs?
On PIM-Aware Mitigations: Given that a generic, low-level mitigation like PRAC imposes such high overheads, what is your vision for a more intelligent, PIM-aware mitigation? For instance, could the memory controller be aware of PuD "sessions" and apply targeted, aggressive refreshing only to neighboring regions of active PuD subarrays, rather than treating every activation as a potential hammer? Could such a semantic approach significantly reduce the observed performance penalty?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present an experimental characterization study on the read disturbance effects that arise from multiple-row activation patterns, a core operational primitive of many Processing-using-DRAM (PuD) techniques. The paper identifies and names this phenomenon "PuDHammer," demonstrating that PuD access patterns—specifically consecutive (COMRA) and simultaneous (SiMRA) multiple-row activations—can induce bitflips far more effectively than traditional single-aggressor RowHammer attacks. The core of the work is an extensive empirical analysis across 316 real DDR4 chips, exploring the impact of various parameters on this vulnerability. The authors conclude by evaluating the effectiveness of adapting an existing industry mitigation (PRAC) for this new threat vector.

Strengths

The primary strength of this paper lies in its novel claim: the identification and first-ever systematic characterization of a new class of read disturbance vulnerability. While the constituent concepts are not new—Processing-using-DRAM and RowHammer are both well-established fields—their interaction has been, until now, an unexplored area. My analysis confirms the authors' claim that prior work has not investigated the security and reliability implications of the multiple-row activation patterns inherent to PuD.

The novelty can be broken down as follows:

Identification of a New Phenomenon: The core contribution is the empirical proof that the access patterns required for PuD operations create a potent read disturbance vector. Prior PuD literature (e.g., Ambit [29], RowClone [40], ComputeDRAM [73]) focused on demonstrating the functionality and performance benefits of multiple-row activation, overlooking the potential for malicious exploitation or reliability degradation. This paper bridges that critical gap. The "delta" over prior art is therefore the entire investigation itself; it moves from "PuD can be done" to "here are the unintended, severe consequences of doing PuD."
Novel Experimental Insights: The paper provides a wealth of new data. For example, Observation 14 (Page 9), which finds that SiMRA and RowHammer have opposite dominant bitflip directions, is a genuinely new and surprising finding. This suggests a fundamentally different underlying physical mechanism, a significant contribution to the community's understanding of DRAM reliability that did not exist previously. The discovery that PuDHammer patterns can reduce the hammer count to the first flip (HC_first) by up to 158.58x (Abstract, Page 1) is not a marginal improvement; it represents a phase change in the severity of read disturbance threats.
Systematic Categorization: The distinction between COMRA and SiMRA as separate classes of PuDHammer, each with unique characteristics, is a novel and useful categorization that will inform future work in this area.

Weaknesses

While the central discovery is novel, some of the surrounding elements rely heavily on established methods and concepts, which dilutes the overall novelty of the work.

Methodology: The experimental infrastructure (DRAM Bender [75] built on SoftMC [197]) and characterization methodology are standard practice, directly following the path laid by numerous prior RowHammer analysis papers (e.g., [125, 144, 145]). The novelty is in the access patterns tested, not in the methodology used to test them.
Countermeasures: The novelty in the countermeasures section (Section 8, Page 12) is limited.
- The three proposals in Section 8.1 (separating arrays, refreshing post-operation, weighted counting) are logical, high-level extensions of existing security principles like isolation and proactive defense. They lack a novel underlying mechanism and are presented conceptually without implementation.
- The adaptation and evaluation of Per-Row Activation Counting (PRAC) is an application of an existing, industry-standardized technique to a new problem. While the performance analysis provides new and valuable data points, the core mechanism is not new. The "weighted counting" optimization is a straightforward tuning of this existing counter-based approach rather than a fundamentally new idea. The novelty here is in the application and evaluation, not the invention.

Questions to Address In Rebuttal

The core novelty of this work is the experimental characterization of a previously unexplored phenomenon. However, the proposed countermeasures in Section 8.1 appear to be high-level, conceptual extensions of existing principles (e.g., isolation, proactive refresh). Can the authors elaborate on what is fundamentally novel about these proposed countermeasures beyond applying known concepts to this new problem context?
The evaluation of an adapted Per-Row Activation Counting (PRAC) mechanism in Section 8.2 is a valuable contribution. However, the mechanism itself is an existing industry standard. Is the novelty here purely in the evaluation, or is there a non-obvious, novel aspect to the adaptation of PRAC to handle multiple simultaneous activations that the paper does not fully elaborate on? The "weighted counting" optimization, for instance, seems like a logical tuning of an existing counter-based approach.
Prior works such as ComputeDRAM [73] and others [78, 79] have successfully demonstrated multiple-row activation on COTS DRAM. While their stated goal was not to study read disturbance, did they observe any anomalous behavior or reliability issues that might have hinted at the PuDHammer effect, even if not characterized as such? A more direct positioning against observations from this functionally-closest prior art would strengthen the claim of being the "first."

DREAM: Enabling Low-Overhead Rowhammer Mitigation via Directed Refresh Management

Abstract

This paper focuses on Memory-Controller (MC) side Rowhammer mitigation. MC-side mitigation consists of two parts: First, a tracker to identify the aggressor rows. Second, a command to let the MC inform the DRAM chip to perform victim-refresh for the ...

ACM LINK

Reviews

Review 1

Paper Title: DREAM: Enabling Low-Overhead Rowhammer Mitigation via Directed Refresh Management Reviewer: The Guardian

Summary

The paper addresses the performance and storage overheads of Memory Controller (MC) side Rowhammer mitigations when using the JEDEC Directed Refresh Management (DRFM) standard, as opposed to the hypothetical Nearby Row Refresh (NRR) command assumed by prior work. The authors propose DREAM, a DRFM-aware mitigation framework with two variants. The first, DREAM-R, aims to reduce the performance overhead of randomized trackers (e.g., PARA, MINT) by delaying DRFM commands to increase Rowhammer-mitigation Level Parallelism (RLP). The second, DREAM-C, aims to reduce the storage overhead of counter-based trackers by grouping rows into "gangs" that share a single counter, leveraging the concurrent mitigation capability of DRFMab.

Strengths

Practical Problem Formulation: The paper correctly identifies a critical and practical gap in the literature: the significant performance penalty of using the real-world DRFM standard compared to the idealized NRR. Grounding the research in the constraints of an existing JEDEC standard is a notable strength.
Clear Core Concepts: The two central ideas—delaying DRFM to improve RLP in DREAM-R and group-tracking to reduce storage in DREAM-C—are intuitive and directly leverage the specified capabilities of the DRFM interface. The graphical summary in Figure 1 provides a clear overview of the proposed mechanisms.
Consideration of Secondary Effects: The authors demonstrate foresight by considering the implications of DRFM rate limits (Section 6), a real-world constraint that is often overlooked. While the proposed solution has its own issues, acknowledging the problem is commendable.

Weaknesses

My analysis reveals several significant weaknesses that challenge the security claims and practical viability of the proposed designs.

The Security of DREAM-R is Fundamentally Undermined by the DRFM Delay: The core premise of DREAM-R is to delay mitigation. This introduces a vulnerability window between when a row is identified as an aggressor (sampled into DAR) and when it is actually mitigated (DRFM is issued). An attacker can continue to issue activations to this row during this delay.
- The proposed patch, Active Target-Row Monitoring (ATM), as described in Section 4.4, is an admission of this fundamental flaw. It adds complexity (per-bank registers, counters, and comparators) to solve a self-inflicted problem. The paper does not adequately analyze the overheads and complexity of ATM itself, merely stating it requires "only 3 bytes per bank." This additional hardware is not negligible and complicates the design.
- The security analysis for PARA in Appendix A relies on a probabilistic model (Gamma distribution). This is insufficient. An adversary is not a random process; they will execute a deterministic, worst-case attack pattern to maximize activations within this delay window. The analysis fails to address such a targeted attack.
DREAM-C's Randomized Grouping is Susceptible to Pathological Conflicts: The paper correctly identifies that a naive set-associative grouping leads to hot counters (Figure 15, top). The proposed solution is randomized grouping. However, this is not a panacea.
- The security of this scheme relies on the unpredictability and quality of the "random masks" initialized at boot time. The paper does not discuss the source of this randomness or its resilience against side-channel analysis or reverse engineering. An attacker with knowledge of the XOR-masking scheme could potentially identify row addresses that collide into the same DCT entry, constructing a pathological workload to create a "hot gang" and trigger frequent, high-overhead DRFMab commands, leading to a Denial-of-Service attack. The paper's claim in Section 6.3 that the likelihood of finding such rows is "less than 1 in a trillion" is an assertion without sufficient proof, especially against a knowledgeable adversary.
- The results in Figure 15 (bottom) show that for workloads like parest and lbm, the slowdown remains over 10% and 24% respectively at TRH=500 and TRH=250. This directly contradicts the claim in the abstract of "incurring negligible slowdown." These are not edge cases; they represent significant performance degradation under certain access patterns that the randomization failed to adequately smooth out.
The Proposed Handling of DRFM Rate-Limits Introduces a Security Hole: To comply with DRFM rate limits, the authors propose a Recent-Mitigated-Address-Queue (RMAQ) that explicitly skips mitigations for rows that have been recently mitigated (Section 6.1, Figure 18).
- This is a clear and exploitable security vulnerability. An attacker can hammer a row just enough to get it selected for mitigation and placed in the RMAQ. For the subsequent window (up to two tREFI periods), the attacker can hammer this same row with impunity, knowing that the defense will deliberately ignore it.
- The analysis in Section 6.2 claims this only has a "minor impact" on the tolerated TRH. This downplays the severity. It changes the security guarantee from "a row will be mitigated" to "a row might be mitigated, unless we just did so." This is a qualitative weakening of the defense that is not acceptable for a security mechanism.
Inconsistent Claims and Comparisons:
- As noted, the claim of "negligible slowdown" for DREAM-C is overstated, with the paper's own data showing slowdowns well above 10% for specific workloads.
- The comparison with ABACUS in Section 5.8 characterizes it as being equivalent to DREAM-C's set-associative design. This is a partial comparison. ABACUS's Sibling Activation Vector (SAV) is a specific mechanism to handle the common case of streaming accesses across banks efficiently. DREAM-C's randomized grouping is a different approach to a similar problem. The comparison does not engage with the technical novelty of the SAV mechanism, potentially presenting an uncharitable view of the competing work.

Questions to Address In Rebuttal

Regarding DREAM-R: Please provide a security analysis against a deterministic, worst-case adversary who is aware of the DRFM delay and the ATM mechanism. The current probabilistic model is insufficient. How can you guarantee security when the fundamental design choice is to delay mitigation?
Regarding DREAM-C: The security of randomized grouping hinges on the secrecy and strength of the per-bank random masks. What is the assumed threat model regarding the attacker's knowledge of this grouping function? Please provide a more rigorous argument for the resilience of this scheme against an attacker attempting to find collisions and construct a pathological access pattern.
Regarding Performance Claims: How do you reconcile the claim of "negligible slowdown" for DREAM-C with your own results in Figure 15, which show slowdowns of 13% for parest (TRH=500) and 24.7% for lbm (TRH=250)? Please define "negligible" in this context.
Regarding Rate-Limit Handling: The RMAQ mechanism involves deliberately skipping required mitigations. Please justify how this can be considered a secure design choice. Can you provide an analysis of an attack that specifically exploits the RMAQ by forcing an entry and then hammering the now-unprotected row?

Review 2

Paper Title: DREAM: Enabling Low-Overhead Rowhammer Mitigation via Directed Refresh Management Reviewer Persona: The Synthesizer (Contextual Analyst)

Summary

This paper addresses a critical and timely problem in the domain of hardware-based Rowhammer mitigation: the practical implementation of Memory Controller (MC) side defenses using the real-world JEDEC DDR5 standard. The authors correctly identify that much of the prior academic work in this area has assumed a hypothetical, single-bank Nearby-Row-Refresh (NRR) command, which was not adopted. Instead, JEDEC introduced Directed Refresh Management (DRFM), a multi-bank command that incurs prohibitively high performance overheads when used in a naive, NRR-like fashion.

The core contribution of this work is to reframe DRFM's multi-bank stall from a liability into an asset. The authors introduce the concept of "Rowhammer-mitigation Level Parallelism" (RLP) and propose a new design philosophy, DREAM, to exploit it. DREAM comes in two variants: 1. DREAM-R: For randomized trackers (like PARA and MINT), it decouples row sampling from mitigation by delaying the DRFM command. This delay allows multiple banks to identify and sample aggressor rows, thus increasing the RLP and amortizing the cost of a single DRFM command across several mitigations. 2. DREAM-C: For counter-based trackers, it leverages DRFM's ability to mitigate rows in all banks simultaneously by "ganging" multiple rows (one from each bank) to a single shared counter. This dramatically reduces the storage overhead of counter-based schemes by up to 32x or more.

The paper demonstrates that these approaches can reduce the performance overhead of randomized trackers to levels comparable to or even better than the theoretical NRR command, and can make counter-based tracking practical at very low thresholds with minimal storage.

Strengths

Exceptional Timeliness and Practical Relevance: This paper's greatest strength is its grounding in reality. It bridges a significant gap between the theoretical assumptions of past academic proposals and the concrete engineering challenges posed by the available JEDEC DDR5 standard. By being one of the first works to seriously analyze and architect solutions around DRFM, it provides immense value to SoC designers and memory architects who must build secure systems today. This work effectively "course-corrects" the field of MC-side Rowhammer mitigation.
Elegant and Powerful Core Insight: The central idea—to exploit, rather than suffer from, DRFM's multi-bank nature—is both simple and profound. The concept of RLP is a clear and useful abstraction. This reframing turns what the paper shows to be a disastrous performance problem (e.g., 49% slowdown for PARA in Figure 5, p. 5) into a manageable one, demonstrating a high degree of intellectual contribution.
Comprehensive and Well-Structured Solution Space: The authors do not propose a single, narrow solution. Instead, they thoughtfully tailor their core insight to the two dominant paradigms of MC-side tracking: randomized and counter-based. DREAM-R's temporal batching (delaying) and DREAM-C's spatial batching (ganging) are elegant counterparts. This shows a deep understanding of the broader design space and makes the work widely applicable.
Strong Connection to the Broader Landscape: The paper does an excellent job of positioning its contributions. It not only re-evaluates foundational trackers like PARA and MINT but also provides a compelling comparison to recent in-DRAM industry proposals like PRAC (Figure 19, p. 12). By showing that DREAM can offer better performance than PRAC at much lower complexity, it makes a strong case for the continued relevance and potential superiority of MC-side defenses.

Weaknesses

While the core ideas are strong, there are areas where the real-world complexities slightly cloud the elegance of the initial concept. These are not fatal flaws but rather avenues for further refinement.

Introduced Complexity in DREAM-R's Security Model: The act of delaying DRFM in DREAM-R necessitates a re-evaluation of the underlying tracker's security guarantees. The paper commendably addresses this by proposing revised tracker parameters and an Active Target-Row Monitoring (ATM) mechanism (Section 4.4, p. 7). However, this adds state (a copy of the sampled row, a counter) and complexity that slightly detracts from the simplicity of the core idea. The solution is sound, but it underscores that the "delay" is not "free" from a security logic perspective.
Handling of DRFM Rate Limits: The discussion around JEDEC-mandated rate limits on DRFM (Section 6, p. 11) feels somewhat like a necessary but cumbersome addition. The proposed Recent-Mitigated-Address-Queue (RMAQ) is a pragmatic solution, but it introduces yet another hardware structure and associated logic to manage. While the authors correctly identify that future mitigation techniques could obviate this, it adds a layer of complexity to the current proposal.
Potential Second-Order Effects: The paper focuses primarily on the direct performance impact (slowdown) and storage costs. However, the proposed mechanisms, particularly delaying DRFM commands in DREAM-R, could have interesting second-order effects on the memory scheduler, especially in systems with strict Quality-of-Service (QoS) requirements. An impending, high-priority DRFM command that stalls 8 banks could complicate scheduling decisions. This broader system interaction is an area left for future exploration.

Questions to Address In Rebuttal

Regarding the Active Target-Row Monitoring (ATM) in DREAM-R: Could the authors elaborate on the security implications if a designer chose to omit ATM for cost reasons and instead relied solely on adjusting the tracker parameters (e.g., using p=1/85 for PARA)? How much does this weaken the security guarantee, and under what threat model would this become a significant vulnerability?
The Denial-of-Service (DoS) analysis for DREAM-C (Section 5.5, p. 10) argues that the worst-case throughput reduction is comparable to other memory contention issues like row-buffer conflicts. However, a DRFM-based DoS stalls an entire sub-channel. Could an attacker craft a workload that targets multiple distinct counter "gangs" in rapid succession to create a more sustained and severe system-wide availability issue than traditional bank-level or rank-level contention?
The paper argues compellingly against the high intrinsic overhead of PRAC. However, PRAC offloads the tracking logic entirely to the DRAM. DREAM keeps this logic on the MC. Could the authors comment on the trade-offs in terms of verification complexity, power, and area for the MC-side logic proposed in DREAM-R (with ATM) and DREAM-C (with randomized grouping and vertical sharing)? How does this compare to the cost of simply tolerating PRAC's performance hit?

Review 3

Paper Title: DREAM: Enabling Low-Overhead Rowhammer Mitigation via Directed Refresh Management Reviewer ID: Persona 3 (The Innovator)

Summary

This paper presents DREAM, a memory controller (MC) side Rowhammer mitigation framework designed to reduce the high overheads associated with the JEDEC Directed Refresh Management (DRFM) command. The authors identify that naively substituting the hypothetical NRR command with the standardized DRFM command leads to prohibitive performance degradation due to multi-bank stalls. The work's core thesis is that the multi-bank nature of DRFM, while being a source of overhead, can be exploited for parallelism.

The authors propose two distinct contributions based on this insight: 1. DREAM-R: A technique for randomized trackers that decouples row sampling from mitigation. By delaying the issuance of a DRFM command, it allows multiple banks to sample aggressor rows, thereby increasing the number of rows mitigated per command (increasing "Rowhammer-mitigation Level Parallelism" or RLP) and reducing the frequency of multi-bank stalls. 2. DREAM-C: A technique for counter-based trackers that reduces storage overhead. It leverages the DRFMab command's ability to mitigate 32 rows concurrently by grouping rows from 32 different banks to share a single counter. A novel randomized grouping function is proposed to prevent "hot counters" that would otherwise arise from spatially correlated memory accesses.

The paper claims novelty in being the first to analyze the performance of DRFM for MC-side mitigations and in proposing a management layer (DREAM) that makes existing trackers practical with this new standard.

Strengths (Novelty-focused Analysis)

The primary strength of this work lies in its novel approach to managing a new, real-world hardware standard (DRFM), rather than proposing yet another hypothetical tracker or mitigation command. The novelty is subtle but significant.

Novel Exploitation of a Standard's Properties: The central novel idea is the recognition that DRFM's multi-bank mitigation capability is a double-edged sword. Prior art has largely focused on trackers (PARA, MINT, Graphene) and assumed a simple, single-bank mitigation primitive like NRR. This paper's contribution is to architect a solution specifically for the command that actually exists in the JEDEC DDR5 standard. The concept of exploiting Rowhammer-mitigation Level Parallelism (RLP), as termed by the authors, appears to be entirely new in this context.
DREAM-R's Decoupled Management: The idea of delaying a mitigation command to batch operations is the key novelty in DREAM-R. In the context of prior NRR-based work, this would be nonsensical, as there is no parallelism to gain. The novelty is therefore fundamentally tied to the properties of DRFM. I have not seen prior work that proposes decoupling the identification of an aggressor from its mitigation in this manner to improve performance.
DREAM-C's Gang-Tracking: The concept of sharing one counter among a "gang" of rows is novel in its application to Rowhammer. While counter-sharing is a general computer architecture technique, its viability here is uniquely enabled by the DRFMab command's ability to mitigate all 32 rows in a single operation. Without this specific hardware feature, the idea would be impractical. The proposal to track a group of rows because they can be mitigated as a group is a genuinely new design point.
Novelty in Problem Identification: A significant part of the contribution is identifying and quantifying the performance problem of using DRFM naively (Section 2.7, page 5). While this may seem like evaluation, it is the first work to do so, and this identification is the direct motivation for the novel solutions proposed.

Weaknesses (Novelty-focused Analysis)

While the core framework is novel, some of the constituent mechanisms are applications of known principles, which slightly tempers the overall inventive step.

Randomized Grouping is an Application of a Known Concept: The insight that set-associative grouping leads to hot counters due to address mapping is excellent (Section 5.2, page 9). However, the solution—using a randomized mapping (XOR with a per-bank mask)—is a classic technique to mitigate aliasing and hash collisions in caches, predictors, and other hardware tables. The application to ganged Rowhammer counters is new, but the underlying principle of randomization to break spatial correlation is not a fundamental discovery. The delta here is in the application, not the invention of the technique itself.
Active Target-Row Monitoring (ATM) is a Reactive, Not Proactive, Invention: The ATM component proposed for DREAM-R (Section 4.4, page 7) is an elegant piece of engineering. However, its existence is solely to patch a security vulnerability created by the primary novel idea (delaying DRFM). It is a necessary fix, not a standalone novel concept. Its complexity is a direct cost of the core idea, and while well-executed, it does not represent a separate inventive leap.

Questions to Address In Rebuttal

The novelty of DREAM-C's randomized grouping rests on its specific application. Could the authors contrast this technique more directly with conceptually similar anti-aliasing schemes used in cache indexing or network packet classification? Clarifying the "delta" from these well-established uses of randomization would help solidify the novelty of this specific part of the contribution.
The core insight of the paper is to build a management layer atop a multi-entity hardware command (DRFM). Is this a "one-trick pony" for Rowhammer, or do the authors see this as a new class of optimization? For example, are there other emerging JEDEC commands (e.g., for memory security, error correction) where a similar "management" approach of decoupling and batching could yield analogous benefits?
The security of DREAM-R relies on the ATM mechanism to cap the number of unmitigated activations during the DRFM delay. Was this the only design point considered? For instance, could the delay be dynamically sized based on memory traffic intensity, potentially reducing the need for active monitoring in low-intensity phases, or is the fixed-threshold ATM the most robust solution? Exploring this design space would add depth to the novelty claim.

Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression

Abstract

Large language models (LLMs) have demonstrated transformative capabilities across diverse artificial intelligence applications, yet their deployment is hindered by substantial memory and computational demands, especially in resource-constrained ...

ACM LINK

Reviews

Review 1

Paper Title: Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-Aware Cache Compression Reviewer ID: The Guardian

Summary

The authors propose Ecco, a hardware-accelerated, entropy-aware cache compression framework for Large Language Models (LLMs) integrated at the L2 cache level. The technique combines group-wise non-uniform quantization using shared k-means patterns with Huffman coding to achieve high compression ratios (4x for weights/KV cache, 2x for activations). The core architectural contribution is a novel parallel Huffman decoder designed to overcome the traditional latency and throughput limitations of sequential variable-length decoding. The paper claims significant speedups (up to 2.9x over AWQ) and memory capacity improvements (nearly 4x) while maintaining state-of-the-art model accuracy. However, the work rests on a foundation of an exceedingly complex compression scheme whose hardware feasibility is justified by questionable scaling assumptions and lacks critical ablation studies to validate key design choices.

Strengths

Strong Accuracy Preservation: The most compelling aspect of this work is its demonstrated ability to maintain model quality under aggressive, lossy compression. The perplexity results in Table 1 (page 10) and the zero-shot accuracy in Table 2 (page 10) are impressive, showing that the proposed complex quantization and entropy coding scheme is effective at preserving information compared to other state-of-the-art 4-bit methods.
Sound Architectural Placement: The high-level architectural concept of integrating a custom compressor/decompressor with the L2 cache (Figure 1, page 2) is a well-established and logical approach to addressing memory bandwidth bottlenecks. This is a more architecturally pure solution than embedding dequantization logic deep within computational kernels.
Inclusion of Sensitivity Analysis: The sensitivity analysis regarding decompressor throughput and latency (Figure 14, page 13) is a welcome addition. It demonstrates a clear understanding of the architectural constraints and correctly identifies that the decompressor's performance must be tightly coupled with the cache hierarchy to avoid becoming a new bottleneck.

Weaknesses

My primary concerns with this paper revolve around the practical feasibility of the proposed hardware and the rigor of its evaluation. The claims are strong, but the evidence contains several critical gaps and leaps of faith.

Highly Questionable Hardware Cost Analysis: The hardware implementation is enormously complex. The compression algorithm (Figure 4, page 6) involves multi-level normalization, k-means clustering, selection from 64 shared patterns and 4 Huffman codebooks, and variable-length encoding. The decompression pipeline (Figure 8, page 9) requires 64 parallel Huffman decoders with a multi-stage tree-based result aggregator. Yet, the area and power analysis in Section 5.2 (page 11) is superficial at best. The authors synthesize the design in a commercial 28nm process and then simply "scale the area and power metrics to 7nm." This is a fundamentally unsound methodology. Technology scaling is not linear, and factors like wire delay, leakage power, and design rule complexity do not scale predictably in this manner. Claiming this complex machinery occupies less than 1% of the chip area and consumes only 7.36 W based on such a crude estimation is not credible. This weakness fundamentally undermines the paper's claim of practicality.
Critical Unsupported Claim in KV Cache Compression: In Section 3 (page 7), the authors state that for online KV cache compression, they replace the computationally expensive Mean Squared Error (MSE) calculation for pattern selection with a simplified min/max value comparison. They then assert that this simplification "incurs only a minimal drop in perplexity." There is absolutely no data presented in the paper to support this claim. An ablation study comparing the perplexity/accuracy of the online min/max method versus the offline MSE-based method is essential for validation. Without this evidence, this critical design choice for the dynamic KV cache—a major component of the overall system—is entirely unsubstantiated.
Misleading Performance Baseline Comparison: The performance evaluation in Section 5.3 (page 11) frames the speedup against methods like AWQ and SmoothQuant. While these are relevant SOTA quantization frameworks, they are primarily software-algorithmic techniques that do not presuppose custom hardware acceleration. The authors are comparing their specialized hardware solution against software methods running on general-purpose hardware. This is not an apples-to-apples comparison. A fair comparison would require benchmarking against other hardware-accelerated compression schemes from the computer architecture literature. By failing to do so, the reported speedups appear inflated, as they conflate the benefits of their specific algorithm with the inherent benefits of any hardware acceleration.
Convenient Omission of Key Baseline: The authors explicitly exclude Quarot from the performance evaluation (Section 5.3, page 11) because it is slower than the FP16 baseline. This is a critical omission. Table 1 (page 10) shows that Quarot's accuracy is highly competitive with Ecco. The entire narrative of the paper is that complex schemes like Quarot pay a prohibitive runtime cost, which Ecco's hardware solves. The most powerful way to demonstrate this would be to include Quarot in the performance charts (Figure 11, page 12) to visually show its high accuracy but poor latency, thereby perfectly motivating the need for Ecco. By removing it, the authors avoid a direct comparison with their closest competitor on the accuracy-complexity frontier, weakening their argument.
Unjustified Critical Path Latency Figure: The entire performance argument hinges on the decompressor being fast enough. The paper claims a 28-cycle latency for the high-ratio decompressor (Section 5.2, page 11). This number is presented as fact without any supporting detail on its derivation. Was this the result of a synthesis run targeting the A100's clock frequency? Does this figure account for the full pipeline including input buffering and output mapping? Given the complexity of the parallel decoder and result aggregator, a 28-cycle latency is aggressive and requires rigorous justification, which is currently absent.

Questions to Address In Rebuttal

Please provide a detailed justification for the 28nm to 7nm scaling methodology used in your hardware cost analysis. Acknowledge the non-linear effects of process scaling and explain why you believe your estimation is nonetheless reliable. Better yet, provide an analysis based on a more appropriate technology library or a detailed breakdown of the logic elements that justifies the area claim.
You must provide the data from an ablation study that compares the model accuracy (e.g., perplexity on WikiText-2) of your KV cache compression using the proposed online min/max pattern selection versus the offline MSE-based selection. This is necessary to substantiate the claim that the performance drop is "minimal."
Please justify the fairness of comparing your custom hardware solution against software-only quantization methods. Discuss relevant prior work on hardware-accelerated cache compression (even if not LLM-specific) and position your work in relation to those architectural baselines.
Please provide a detailed breakdown of the 28-cycle decompressor latency. How was this value determined? What were the synthesis constraints (e.g., target clock frequency)? What specific stages are included within this latency figure?
Explain the reasoning for excluding Quarot from the performance speedup comparisons in Section 5.3, given that it serves as a key accuracy baseline in Table 1 and represents the exact class of high-complexity, high-accuracy algorithms your hardware aims to accelerate.

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper proposes Ecco, a novel, domain-specific, lossy compression scheme designed to be integrated into the GPU memory hierarchy to alleviate the memory bandwidth and capacity bottlenecks in Large Language Model (LLM) inference. The core contribution is the synthesis of several compression techniques—group-wise non-uniform quantization via shared k-means patterns and entropy coding via Huffman codes—into a hardware-realizable system. Crucially, the authors address the primary obstacle to using variable-length codes like Huffman in a high-performance memory path by designing a novel parallel Huffman decoder with a multi-stage pipeline. This architectural innovation aims to achieve decompression throughput comparable to a GPU L2 cache, making the entire scheme practical. The proposed system is evaluated through extensive simulation, demonstrating significant speedups (up to 2.9x over AWQ) and memory capacity improvements (nearly 4x) while maintaining state-of-the-art model accuracy.

Strengths

This work's primary strength lies in its excellent synthesis of ideas from computer architecture, machine learning, and information theory to create a compelling, systems-level solution to a critical problem.

Elegant Problem-Solution Fit: The authors correctly identify that the statistical properties of LLM weights and activations exhibit low entropy, which is a perfect target for entropy coding. While software-based quantization methods have exploited this, they often introduce computational overhead (as noted in Section 2.3, page 4). By moving a sophisticated, entropy-aware compression engine into the hardware cache controller, Ecco makes the process transparent to the software stack and avoids polluting the compute kernels with decompression logic. This is a powerful architectural approach that directly maps the nature of the data to the hardware that handles it.
Addressing the Core Implementation Challenge: The proposal would be merely a theoretical exercise without a feasible high-throughput decoder for the variable-length Huffman codes. The design of the parallel, pipelined decoder (Section 4.2, page 8 and Figure 8) is the lynchpin of this paper. Recognizing that sequential decoding is a bottleneck and architecting a parallelized solution demonstrates a deep understanding of the practical constraints of memory subsystem design. This transforms a good idea into a plausible engineering proposal.
Holistic System Design: The work is not just an algorithm but a well-considered system. The authors have thought through the entire data path, from the fixed-size compressed blocks that align with memory transactions to the trade-offs in the design space exploration (Figure 5, page 7) for k-means patterns and Huffman codebooks. The inclusion of area and power analysis (Section 5.2, page 11), while based on synthesis and scaling, adds a crucial layer of credibility to the proposed hardware's feasibility.
Strong Contextualization: The paper does a good job of situating its contribution within the broader landscape. It clearly distinguishes its architectural approach from purely algorithmic methods like AWQ and Quarot and from general-purpose lossless hardware compression found in current GPUs. It correctly frames the problem as a new "memory wall" specific to the LLM era.

Weaknesses

The paper's weaknesses are not fundamental flaws but rather areas where the analysis could be deepened to make the proposal even more robust.

Sensitivity to Calibration Data: The effectiveness of the entire scheme, particularly the pre-defined k-means patterns and Huffman codebooks, hinges on the calibration dataset being representative of real-world inference workloads. While The Pile is a diverse dataset, LLMs are increasingly used in specialized domains (e.g., code generation, medical analysis) where data distributions might shift significantly. The paper does not explore the sensitivity of Ecco's performance (both in terms of model accuracy and compression efficiency) to such distribution shifts. A more robust system might require some form of lightweight online adaptation, which is not discussed.
Lack of a Simpler Hardware Baseline: The performance comparison is primarily against software frameworks (AWQ, SmoothQuant) and an uncompressed FP16 baseline. While insightful, this doesn't fully isolate the benefits of Ecco's complexity. A valuable addition would be a comparison against a simpler lossy hardware compression scheme—for instance, a hardware implementation of simple group-wise uniform quantization without the Huffman coding. This would help quantify the specific gains achieved by the more complex entropy coding stage and better justify its area and power cost.
Limited Discussion of Synergies: The paper positions Ecco as an alternative to existing software quantization methods. However, it seems plausible that these approaches could be synergistic. For example, could a model already quantized with a method like AWQ be further compressed by Ecco for additional benefits? Furthermore, Ecco operates at the level of data representation, while techniques like PagedAttention operate at the level of memory management. The potential interplay between these orthogonal hardware and software optimizations is a rich area for discussion that is currently absent.

Questions to Address In Rebuttal

On Robustness: Could the authors comment on the robustness of the pre-calculated codebooks and k-means patterns to out-of-distribution data encountered during inference? Have you performed any experiments to measure the degradation in accuracy or compression ratio when the inference data statistics differ significantly from the calibration set (The Pile)?
On Justifying Complexity: The proposed compressor/decompressor is significantly more complex than the lossless compressors in today's GPUs. To better justify this, could you provide insight into what performance would be achieved by a simpler hardware implementation of just the group-wise non-uniform quantization part of your pipeline, without the subsequent Huffman coding? This would help isolate the contribution of the entropy coding stage.
On System-Level Integration: How do you envision Ecco interacting with modern LLM memory management systems like PagedAttention? Since PagedAttention already optimizes memory usage by managing the KV cache in non-contiguous pages, could Ecco be applied on a per-page basis to compound the benefits, effectively storing 4x as many token states within the same physical memory footprint?

Review 3

Review Form: The Innovator (Novelty Specialist)

Summary

The authors propose "Ecco," a hardware-accelerated cache compression system designed to alleviate memory bandwidth and capacity bottlenecks in Large Language Model (LLM) inference. The core of Ecco is a compression scheme that combines several known techniques: group-wise, non-uniform quantization using shared k-means codebooks, followed by Huffman coding for entropy encoding. The central claim to novelty lies not in the compression algorithm's individual components, but in their synthesis into a hardware system and, most critically, in the design of a novel parallel Huffman decoder architecture. This decoder is designed to overcome the inherent sequential limitations of Huffman codes, enabling throughput high enough for on-the-fly decompression between an L2 cache and streaming multiprocessors. The paper presents this architectural solution, evaluates its area and power, and demonstrates significant speedups and memory savings for LLM inference.

Strengths

The primary strength of this work lies in its architectural solution to a well-known problem. While the components of the compression scheme are familiar, the key innovation is making a theoretically powerful but practically slow compression technique (Huffman coding) viable at cache-level speeds.

Novel Parallel Huffman Decoder Architecture: The most significant novel contribution is the design of the parallel Huffman decoder detailed in Section 4.2 (page 8) and Figure 8 (page 9). The problem of parallelizing variable-length codes is decades old, but the proposed architecture—which uses 64 parallel decoders operating on overlapping 15-bit data chunks and a six-stage tree-based result aggregator—is a specific and clever engineering solution. This design directly addresses the critical path latency and throughput requirements of a GPU memory subsystem, which is a non-trivial architectural challenge. This is the lynchpin that makes the entire proposal feasible.
Novel Synthesis for a New Domain: While the constituent compression techniques are not new, their combination and application as an on-the-fly hardware cache compression scheme for LLMs is novel. Most prior work using k-means and Huffman coding (e.g., Deep Compression) has focused on creating statically compressed models for offline storage. Ecco’s contribution is to architect a system that performs this complex, lossy, entropy-aware compression/decompression dynamically as part of the memory hierarchy, which is a fundamentally different operational paradigm.

Weaknesses

My main concerns revolve around the paper's framing, which could be interpreted as claiming novelty for the compression primitives themselves, and a need for a more explicit comparison against prior art in parallel decoder design.

Constituent Algorithmic Components are Not Novel: The paper's core compression methodology is built upon a foundation of well-established prior art.
- Huffman Coding for NN Compression: The idea of using Huffman coding to compress quantized neural network weights is not new. It was a cornerstone of the highly influential "Deep Compression" paper (Han et al., 2016, cited by the authors as [25]), which demonstrated its effectiveness years ago.
- Non-Uniform Quantization via K-Means: Using k-means clustering to generate codebooks for non-uniform quantization is also a standard technique. It was used in the aforementioned Deep Compression work and more recent LLM-specific work like SqueezeLLM (cited as [37]).
- Group-wise Quantization: This is the standard approach in modern quantization methods like AWQ (cited as [42]) to balance compression ratio and accuracy.
The paper's novelty is therefore not in what it does algorithmically, but how and where it does it (i.e., in a hardware cache controller). The current framing could be sharpened to de-emphasize the algorithmic components as novel in themselves and focus more squarely on the architectural innovation.
Insufficient Differentiation from Prior Parallel Decoder Architectures: While the proposed decoder architecture appears novel in its specific implementation, the paper would be stronger if it explicitly situated its design within the broader literature of parallel Huffman/VLC decoding. Prior work exists on speculative decoding, lookup-table-based methods, and other chunking strategies. A brief discussion of why the proposed overlapping-window and tree-aggregation approach was chosen over these alternatives would better substantiate the novelty and design choices of that core component.

Questions to Address In Rebuttal

To strengthen the paper and clarify its precise contribution, I would ask the authors to address the following:

Please clarify the primary novel contribution of this work. Can you explicitly differentiate your work from "Deep Compression" and "EIE: Efficient Inference Engine" (Han et al., 2016, cited as [24])? Specifically, while those works also use Huffman coding and hardware acceleration, could you articulate how Ecco's focus on a general-purpose cache controller (as opposed to a full inference engine) and its specific parallel decoding architecture represent a significant delta?
The complexity of the proposed compressor/decompressor is substantial, involving bitonic sorters, multiple k-means pattern matchers, and the parallel Huffman decoder. Have you considered simpler lossy compression schemes that could be implemented in hardware? What is the justification for choosing this highly complex scheme over, for example, a simpler block-based transform coding or a more aggressive delta-based compression scheme that might offer a better trade-off between hardware cost and compression ratio?
Could you provide more context on the design of your parallel Huffman decoder (Section 4.2)? What are the primary trade-offs of your design (e.g., overlapping windows, fixed number of stages) compared to other approaches for parallelizing variable-length decoding found in prior art?

MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training

Abstract

In distributed training of large DNN models, the scalability of one-dimensional (1D) tensor parallelism (TP) is limited because of its high communication cost. 2D TP attains extra scalability and efficiency because it reduces communication relative to 1D ...

ACM LINK

Reviews

Review 1

Paper Title: MeshSlice: Efficient 2D Tensor Parallelism for Distributed DNN Training Reviewer: The Guardian

Summary

The authors propose MeshSlice, a novel 2D GeMM algorithm for distributed DNN training. The central idea is to partition (or "slice") collective AllGather (AG) and ReduceScatter (RdS) operations into multiple smaller, partial collectives. This partitioning, combined with software pipelining, is claimed to enable the overlap of communication and computation in both row and column dimensions of a 2D accelerator mesh. The paper also introduces an autotuner that uses analytical cost models to select optimal dataflows, mesh shapes, and communication granularities.

However, the paper's central claims of superiority rest almost entirely on simulated results derived from a model calibrated on a very small-scale system. The included real-hardware experiments not only fail to support the primary claims but actually show MeshSlice underperforming against established baselines due to current software limitations. The validity of the large-scale simulations and the practical feasibility of the proposed approach are therefore highly questionable.

Strengths

Problem Motivation: The paper correctly identifies a critical bottleneck in scaling tensor parallelism: the inability of standard "Collective 2D GeMM" algorithms to overlap communication with computation. The motivation to solve this is well-founded.
Conceptual Approach: The core idea of slicing a collective operation into multiple partial collectives to facilitate pipelining is conceptually sound and presents an alternative to decomposing them into point-to-point SendRecv operations (as in Wang et al. [34]) or fine-grained broadcasts (as in SUMMA [30]).
Autotuner Framework: The recognition that 2D TP performance is sensitive to a complex interplay of hyperparameters (dataflow, mesh shape, sharding, granularity) is astute. Proposing an autotuner to navigate this space is a valuable direction, even if the implementation here has weaknesses.

Weaknesses

Over-reliance on Simulation and Lack of Empirical Validation: The primary weakness of this work is that its performance claims are not substantiated by real-world evidence.
- The impressive speedup figures (e.g., 12.0% and 23.4% faster than the state-of-the-art in Section 1, Abstract) are derived exclusively from the SST simulator (Section 5.1).
- The simulator itself is calibrated on a small 4-chip TPUv4 ring (Section 4.1, page 9). It is a significant and unsubstantiated leap to assume that a model calibrated on a 4-chip 1D topology can accurately predict the complex network dynamics, contention, and synchronization overheads of a 256-chip 2D torus. Non-linear scaling effects are not adequately addressed.
- Critically, the real hardware experiments on a 4x4 TPUv4 cluster (Section 5.3, Table 3) show MeshSlice performing worse than the Collective baseline (45.5% vs 47.4% FLOP utilization for GPT-3). The authors attribute this to a lack of software support for overlapping AG/RdS, but this does not excuse the lack of empirical proof. The "MeshSlice Overlap (Estim.)" column in Table 3 is purely speculative and cannot be considered a result.
Flawed Validation of the Autotuner's Cost Model: The autotuner's utility depends entirely on the accuracy of its analytical cost models. The validation provided is circular and insufficient.
- The cost model for communication is a simple linear function (Section 3.2.2, page 8) that ignores network contention, a crucial factor in large, busy torus networks.
- The validation in Figures 13 and 14 compares the analytical model's predictions against the simulator's results, not against real hardware. This merely demonstrates that the simple analytical model can approximate the behavior of the complex simulator. It provides zero evidence that either the model or the simulator accurately reflects reality at scale.
Unclear Practicality and Ignored Implementation Hurdles: The authors admit that their method is not currently feasible on the target hardware (TPUv4) due to the lack of support for asynchronous, overlapped AG/RdS operations (Section 5.3).
- This is not a minor implementation detail; it is a fundamental barrier to adoption. The paper presents a technique that relies on a software/hardware capability that does not exist in the evaluated public environment.
- There is no discussion of the software engineering, compiler modifications, or runtime support that would be required to enable this functionality. Without this, the work remains a purely theoretical exercise.
Insufficient Justification for Slicing Granularity (S): The paper claims that the slice count S provides a knob to trade off prologue/epilogue overhead against synchronization overhead (Section 3.1, page 5). However, the analysis is superficial. The optimal S is found via an exhaustive search over a small set of integers. A more rigorous analysis would characterize this trade-off more formally and explain why this new partitioning scheme is fundamentally better than existing ones (e.g., Wang et al.'s SendRecv decomposition), which also allows for granularity control via message size.

Questions to Address In Rebuttal

How can the authors justify the validity of their 256-chip simulation results when the simulator was calibrated on a 4-chip ring? Please provide specific evidence or analysis showing that network contention and synchronization overheads are modeled accurately at this scale.
Given that MeshSlice underperforms on the real 4x4 TPUv4 cluster (Table 3), why should the community accept the speculative "Overlap (Estim.)" figures as evidence of the algorithm's effectiveness? This appears to be a claim based on a hypothetical machine, not the one that was tested.
The autotuner's cost model is validated against the simulator, not hardware. Can you provide any data from real hardware (even at the 4x4 scale) that validates the accuracy of your analytical communication cost model?
The entire premise of MeshSlice hinges on the availability of asynchronous AG/RdS collectives. Could you detail the specific compiler and runtime system changes required to enable this on a platform like JAX/XLA for TPUs? What is the estimated engineering effort and feasibility of such changes?
In the real hardware experiments (Section 5.3.1), Wang et al.'s method also shows a smaller-than-expected speedup, which you attribute to JAX compiler optimizations creating dependencies. Could this same compiler behavior also be creating unforeseen dependencies for MeshSlice, invalidating the core assumption that your partial GeMMs can be effectively overlapped with your partial collectives in a real execution environment?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces MeshSlice, a novel algorithm for 2D tensor parallelism (TP) in distributed DNN training, designed to overcome the limitations of existing 2D GeMM methods. The core problem is that prior algorithms either incur high communication traffic (Cannon), suffer from synchronization overhead (SUMMA), or cannot effectively overlap communication with computation (Collective-based approaches). The state-of-the-art (e.g., Wang et al.) can only overlap communication in one dimension of the 2D accelerator mesh.

The essential contribution of MeshSlice is a new technique for partitioning collective communication operations (AllGather and ReduceScatter) into multiple partial collectives. This "slicing" enables software pipelining that can overlap communication with computation along both the row and column dimensions of the accelerator mesh simultaneously. This bidirectional overlap is the key mechanism for hiding communication latency more effectively. Complementing the algorithm, the authors present the MeshSlice LLM autotuner, which uses analytical cost models to automate the complex task of finding optimal configurations for dataflow, mesh shape, and communication granularity. The work is evaluated via simulation on TPUv4-like clusters and demonstrates significant performance improvements over existing methods for training large language models like GPT-3 and Megatron-NLG.

Strengths

Elegant and Foundational Core Idea: The central concept of partitioning a collective operation into smaller, partial collectives is a powerful and elegant solution to the communication overlap problem. It directly addresses the "all or nothing" nature of the standard Collective 2D GeMM algorithm and provides a principled way to overcome the one-dimensional limitation of prior overlap techniques. This feels less like a heuristic and more like a fundamental primitive that was missing from the distributed computing toolbox for this domain.
Excellent Contextualization and Problem Framing: The paper does a superb job of placing itself within the historical and current landscape of distributed training. The introduction and background sections (Sections 1 & 2, pages 1-4) clearly articulate the evolution from 1D TP to 2D TP and provide a concise yet thorough analysis of the trade-offs between Cannon, SUMMA, and Collective-based algorithms. The timeline comparison in Figure 4 (page 5) is particularly effective at visualizing the specific inefficiency that MeshSlice resolves. This demonstrates a deep understanding of the research area.
Addresses the Full System Problem: The authors recognize that a novel algorithm alone is insufficient without a means to configure it. The inclusion of the MeshSlice LLM autotuner is a major strength. Optimizing 2D TP involves a complex interplay between sharding, dataflow, mesh topology, and granularity (the slice count S). By providing an automated, model-driven approach to this co-optimization problem (Section 3.2, page 7), the authors elevate the work from a theoretical algorithm to a practical, usable system. This greatly increases its potential for real-world impact.
Significant Potential Impact on Hardware Co-Design: This work has clear implications for the future of ML accelerator design. By providing a software technique that can effectively utilize 2D torus interconnects, it strengthens the case for building such topologies over more complex and expensive fully-connected networks, especially at scale. The ability of MeshSlice to adapt to non-square mesh shapes is crucial, as it allows system designers to match the mesh topology to the aspect ratio of the underlying matrix operations, rather than the other way around.

Weaknesses

While this is a strong paper, there are areas where its context and scope could be further explored.

Dependency on Physical Mesh Topology: The performance benefits of MeshSlice hinge on the existence of a physical 2D torus interconnect where row and column communications are independent and contention-free. This is characteristic of Google's TPU clusters but not of most GPU clusters, which typically use fat-tree or Dragonfly topologies. The authors briefly acknowledge this in the Discussion (Section 6, page 13), but the challenge is significant. On a logical mesh mapped to a fat-tree, row and column collectives would contend for the same network links, potentially negating much of the benefit from bidirectional overlap. The paper would be stronger if it included a more detailed analysis or preliminary data on the sensitivity of MeshSlice to network contention.
Simplicity of the Autotuner's Dataflow Heuristic: The autotuner's method for choosing a dataflow (Section 3.2.1, page 7) is based on a heuristic: keep the largest matrix stationary and default to a non-transposed version to avoid reshuffling data between layers. While this is a sensible and likely effective rule of thumb for standard Transformer architectures, it feels less rigorous than the rest of the paper. It is an open question whether more exotic model architectures or memory access patterns might favor a dataflow that this heuristic would miss.

Questions to Address In Rebuttal

Applicability to Contention-Prone Networks: Could the authors elaborate on the viability of MeshSlice for GPU-based supercomputers? When implementing MeshSlice on a logical mesh over a fat-tree network, how would the analytical cost models in the autotuner need to be adapted to account for network contention between the row and column communications? Would the optimal strategy perhaps revert to overlapping in only one dimension, similar to Wang et al., to avoid this contention?
Robustness of the Dataflow Heuristic: Regarding the autotuner's dataflow selection, have the authors considered architectures where minimizing the size of the communicated tensors is more important than keeping the largest tensor stationary? For example, in a memory-capacity-constrained scenario, it might be preferable to communicate a smaller tensor even if it requires transposing a larger, stationary one. Can you comment on the robustness of your heuristic outside of the standard LLM training regime?
Broader Generalizability of "Partial Collectives": The core idea of slicing collectives into partial collectives to enable pipelining is very compelling. Do you see this as a general-purpose technique? For instance, could this approach be used to optimize other distributed algorithms that are bottlenecked by large, monolithic collective operations, even outside the context of GeMM or machine learning?

Review 3

Reviewer Persona: The Innovator (Novelty Specialist)

Summary

The paper proposes MeshSlice, a new algorithm for 2D General Matrix Multiplication (GeMM) aimed at optimizing 2D Tensor Parallelism (TP) in distributed DNN training. The central novel claim is a technique to partition collective communication operations (AllGather/ReduceScatter) into smaller partial collectives. This partitioning is applied to both the row and column dimensions of the accelerator mesh, enabling communication-computation overlap in both directions simultaneously. This is presented as a solution to the limitations of prior art: Cannon's algorithm is constrained to square meshes, SUMMA has high synchronization overhead, Collective 2D GeMM is blocking, and Wang et al. [34] only achieve overlap in a single dimension. The paper also introduces an autotuner that co-optimizes the dataflow, mesh shape, and a new "slice count" hyperparameter S introduced by the MeshSlice algorithm.

Strengths

Clearly Articulated and Novel Core Algorithm: The paper's primary strength lies in its core algorithmic contribution. The proposed method of slicing a collective operation into a series of smaller, partial collectives to enable pipelining is a distinct and novel approach in the context of 2D distributed GeMMs for DNN training. While prior work has explored partitioning, it has focused on different mechanisms with significant drawbacks.
Excellent Positioning Relative to Prior Art: The authors demonstrate a comprehensive understanding of the landscape of 2D GeMM algorithms. They correctly identify that Wang et al. [34] represents the closest state-of-the-art and precisely articulate the gap that MeshSlice aims to fill: achieving bidirectional overlap without incurring the high traffic and restrictive constraints of Cannon's algorithm [4]. The claim made in Section 2.3.4 (page 4) that partitioning collectives in both directions would otherwise require Cannon's algorithm effectively frames the problem that MeshSlice's novelty is designed to solve.
Introduction of a New, Tunable Parameter: The algorithm introduces the slice count S as a new hyperparameter that controls the granularity of the communication-computation overlap. This is a novel lever for performance tuning that does not exist in the monolithic Collective or Wang et al. algorithms. The autotuner's ability to optimize this parameter is a valuable supporting contribution.
Novelty in the Autotuner's Scope: While the use of analytical cost models is a standard technique, the autotuner's novelty lies in its application to the unique search space created by MeshSlice. Specifically, the ability to choose between different 2D dataflows (Section 3.2.1, page 7) and co-optimize this choice with mesh shape and the new slice count S is a more comprehensive approach than seen in related works like PrimePar [33], which is based on the more restrictive Cannon's algorithm.

Weaknesses

Incremental Nature of the Core Concept: While the application is novel, the fundamental concept of partitioning a large collective operation into smaller chunks to facilitate pipelining is a well-established pattern in high-performance computing (HPC). The paper would be strengthened by acknowledging that the core novelty is not the invention of "partial collectives" but rather their specific, structured application to solve the bidirectional overlap problem in 2D GeMMs for DNNs.
Novelty is Tightly Coupled to Torus Topology: The central claim of efficient, bidirectional overlap relies heavily on the non-contending nature of row and column communications in a physical 2D torus network (as found in TPU clusters). This assumption is critical to the algorithm's performance model. The novelty and associated benefits are less certain on more common interconnect topologies like fat-trees or Dragonfly networks, where logical row and column communications would contend for shared network resources. The discussion in Section 6 (page 13) acknowledges this but does not fully address how the fundamental value proposition of the algorithm might change in these prevalent hardware environments.
The "Slicing" Mechanism Itself: The mechanism for slicing the matrix shards (Section 3.1.2, page 6) is a necessary implementation detail, but it is not presented as a fundamental algorithmic novelty in its own right. It is an enabling technique for the core idea of partial collectives. The paper is clear about this, but it is worth noting that the novelty rests on the communication scheme, not the data layout transformation.

Questions to Address In Rebuttal

The core mechanism relies on partitioning AllGather/ReduceScatter into a loop of partial collectives on sub-shards. Could the authors clarify if this "partial AllGather/ReduceScatter" is a standard primitive available in the underlying communication libraries (e.g., as part of JAX/XLA) or a custom communication schedule implemented specifically for this work? This would help delineate the software engineering contribution from the conceptual novelty.
The key advantage over Wang et al. [34] is achieving overlap in a second dimension. This benefit is maximized on a physical torus where the two dimensions are independent. How does the novelty of MeshSlice hold up on a logical mesh mapped to a physical fat-tree network? Specifically, if row and column communications contend for the same physical links, does the benefit of overlapping the second dimension diminish significantly, potentially making it a marginal improvement over the simpler single-dimension overlap from prior art in that common scenario?
The autotuner optimizes dataflow, mesh shape, and slice count S. Table 2 (page 11) shows a significant 21.2% speedup for GPT-3 from dataflow optimization alone. Can the authors provide a sense of the relative importance of the novel contributions within the autotuner? Is the dataflow selection the dominant factor, or does the co-optimization of the new parameter S with the mesh shape provide a similarly crucial contribution to performance?

OptiPIM: Optimizing Processing-in-Memory Acceleration Using Integer Linear Programming

Abstract

Processing- in-memory (PIM) accelerators provide superior performance and energy efficiency to conventional architectures by minimizing off-chip data movement and exploiting extensive internal memory bandwidth for computation. However, efficient PIM ...

ACM LINK

Reviews

Review 1

Reviewer Persona: The Guardian (Adversarial Skeptic)

Summary

The authors present OptiPIM, a framework that uses Integer Linear Programming (ILP) to generate what they claim are "optimal" mappings for data-intensive operators on Processing-in-Memory (PIM) accelerators. The central thesis is that existing mapping frameworks designed for ASIC accelerators are inadequate for PIM architectures due to PIM's stringent data layout requirements and the resulting complex cost trade-offs. The paper's primary contributions are a new "PIM-friendly" mapping representation based on a nested loop structure with an expanded set of indexing functions, and a more accurate cost model for estimating data movement that accounts for these layouts. The framework is evaluated on two PIM architectures (HBM-PIM and SIMDRAM) and shows significant speedups over heuristic and adapted ASIC-based mapping methods.

While the problem addressed is both timely and significant, I harbor serious reservations regarding the central claims of optimality, the methodological soundness of the baseline comparisons, and the generalizability of the reported results. The work rests on a foundation that appears strong at first glance but reveals critical weaknesses upon rigorous inspection.

Strengths

Problem Motivation: The paper does an excellent job in Section 3 (page 3) of motivating why PIM mapping is a distinct and challenging problem. The example in Figure 2 clearly illustrates how different computation schedules impose vastly different data layout, replication, and movement costs, correctly identifying the core deficiencies of layout-agnostic mapping approaches.
Improved Cost Estimation for Unique Indices: The authors correctly identify a key source of inaccuracy in prior work—the use of loop bound products to estimate data size. The proposed estimation function for unique indices in a linear combination (Equation 11, page 8) is more mathematically principled and demonstrably more accurate (Figure 7) than the naive upper-bound approach. This is a solid, albeit localized, contribution.
Validation against Exhaustive Search: The claim of optimality is, for the specific workloads tested, substantiated by a comparison against a full exhaustive search (Section 7.5.3, page 12). This provides confidence that, within the confines of their defined search space and cost model, the ILP formulation is indeed finding the best solution.

Weaknesses

The Claim of "Optimality" is Circular and Unverified: The paper's most prominent claim is its ability to generate optimal mappings. However, this optimality is only with respect to its own analytical cost model. This model's "ground truth" for validation is a custom-modified version of the Ramulator 2 simulator (Section 7.1, page 10 and Artifact Appendix A.1, page 15). The authors provide no evidence that this simulator has been validated against real PIM hardware or a trusted hardware model. The high correlation shown in Figure 9 only proves that the analytical model is a faithful abstraction of their simulator; it does not prove that either accurately reflects reality. Without validation against a genuine hardware ground truth, the "optimality" is merely a theoretical construct within a self-contained system.
Fundamentally Flawed Baseline Comparison: The comparison against the ASIC-based mapping method (COSA [23]) is a classic strawman argument. The authors state they "adopt COSA’s ILP optimization... without considering the fine-grained data layout-related costs" (Section 6.4.3, page 10). This is an inherently unfair comparison. The entire premise of the paper is that these costs are critical. By disabling the very capability that would make the baseline competitive, the authors are not comparing OptiPIM to a state-of-the-art ASIC mapper adapted for PIM; they are comparing it to a deliberately crippled baseline. A scientifically valid comparison would have involved attempting to integrate the new PIM-specific cost functions into the COSA framework to see if it could then find a comparable mapping. The current results only prove that ignoring PIM costs leads to poor PIM performance, which is a trivial conclusion.
Scalability of the ILP Formulation is Not Proven: The authors report a maximum optimization time of 4 minutes (Abstract, page 1), suggesting the ILP approach is highly efficient. This claim is based on a limited set of operators from common ML models. ILP problem complexity can scale exponentially with the number of variables and constraints. The paper provides no analysis or empirical data on how the solver runtime scales with key problem parameters, such as the number of loop nests, the magnitude of loop bounds, or the number of tensors. A single, more complex operator from a different domain could potentially cause the solver time to explode, rendering the approach impractical. The claim of efficiency is therefore not robustly supported.
Novelty and Expressiveness of the Mapping Representation is Overstated: The proposed mapping representation is described as a key innovation. However, the hierarchical nested loop structure for PU, column, and computation scheduling (Section 4.2, page 5) is conceptually equivalent to standard loop tiling and vectorization, common in many compiler frameworks. The main novelty lies in the exploration of 6 coefficient combinations for the indexing function (Section 4.3, page 6). While this is an improvement over a fixed stride-1 indexing, the paper fails to discuss why this fixed-structure approach is superior to more general and powerful representations like the polyhedral model, which can represent a much broader class of transformations and data layouts. It is not clear that the proposed representation can capture all potentially useful mappings.

Questions to Address In Rebuttal

Regarding the ASIC Baseline: Please justify the decision to evaluate the COSA framework without incorporating the PIM-specific cost terms you identified. Could the authors instead augment the ASIC baseline's cost model with these terms and re-evaluate? If this is not possible, please provide a detailed explanation of the fundamental limitations in the baseline's representation that prevent such an integration.
Regarding the Ground Truth for Optimality: How was the modified Ramulator 2 simulator itself validated? Can you provide any data correlating your simulator's performance predictions with either real commercial PIM hardware (e.g., Samsung's HBM-PIM or AXDIMM) or a hardware vendor's internal, validated simulation models?
Regarding ILP Scalability: Beyond the reported runtimes, can the authors provide data on how the ILP solver time scales as operator complexity (e.g., number of loops, size of loop bounds, complexity of indexing functions) increases? Please characterize the point at which the "within 4 minutes" claim no longer holds.
Regarding the Mapping Representation: Please clarify why the proposed nested loop structure with 6 indexing combinations is fundamentally more expressive or efficient for PIM mapping than existing, more general compiler representations like polyhedral models, which are also amenable to ILP-based scheduling. What specific, valid mappings can your representation express that a polyhedral model cannot, or vice versa?

Review 2

Review Form: The Synthesizer

Summary

This paper presents OptiPIM, a framework for optimizing the mapping of data-intensive workloads onto digital Processing-in-Memory (PIM) architectures. The core contribution is the formulation of this complex mapping problem as an Integer Linear Programming (ILP) problem. The authors rightly argue that existing mapping techniques, designed for conventional ASIC accelerators, are insufficient for PIM due to its strict data layout requirements, which introduce significant costs related to data movement and replication that prior models neglect.

To enable this ILP formulation, the paper introduces two key technical innovations: 1) a novel, layout-aware nested loop representation that explicitly models the hierarchical structure of PIM systems (PU allocation, column allocation, and computation scheduling), and 2) more accurate cost models, particularly for estimating the number of unique memory accesses, which prove to be far more precise than the upper-bound estimates used in previous work. The framework is implemented in MLIR and evaluated on two representative digital PIM architectures (HBM-PIM and SIMDRAM). The results demonstrate that OptiPIM can find provably optimal mappings in minutes, delivering significant performance improvements (at least 1.9x, and much higher in many cases) over established heuristic and search-based methods.

Strengths

This is an excellent and highly significant piece of work. Its primary strength lies in bringing formal optimization to a problem space that has, until now, been dominated by heuristics and ad-hoc methods.

Clear and Compelling Problem Definition: The authors do a masterful job in Section 3 ("Motivation", page 3) of articulating why the PIM mapping problem is fundamentally different from the well-studied ASIC accelerator mapping problem. The example in Figure 2 is particularly illustrative, clearly showing how different mapping choices lead to vastly different data layouts, memory footprints, and data movement costs. This precise problem definition elevates the paper beyond a simple application of a known technique (ILP) to a new domain.
Foundational Contribution to PIM Compilation: By successfully formulating the PIM mapping problem for an ILP solver, OptiPIM provides an "oracle" for the community. It establishes a theoretical performance ceiling for a given operator on a given PIM architecture. This is invaluable; it allows researchers to quantify the performance gap of any new, faster heuristic they might develop. It moves the field from "this mapping seems fast" to "this mapping is X% of optimal."
Technically Sound and Novel Representation: The proposed layout-aware representation (Section 4, page 5) is well-suited to the problem. The explicit separation of PU allocation, column allocation, and scheduling maps cleanly to the physical realities of PIM hardware. Furthermore, the exploration of different data indexing functions (Section 4.3, page 6) is a subtle but powerful enhancement that expands the design space beyond what previous rigid mapping representations allowed. The accurate cost function for unique inputs (Section 5.3, page 8 and Figure 7) is a concrete improvement over prior art and is critical for the ILP solver to find a truly optimal solution.
Rigorous and Insightful Evaluation: The experimental methodology is comprehensive. The validation against a cycle-accurate simulator (Figure 9, page 10) builds confidence in the analytical model. The choice of baselines—including a state-of-the-art PIM heuristic, a search-based method, and an adapted ASIC mapping framework (COSA)—is strong and allows for a nuanced comparison. The performance breakdown plots (Figure 10, page 10) are particularly insightful, clearly showing that OptiPIM wins by minimizing the layout-dependent costs (like input loading) that other methods ignore or model poorly.

Weaknesses

The weaknesses of this paper are largely related to the inherent limitations of its chosen approach (ILP) and the scope of the problem addressed, rather than flaws in the execution of the research itself.

Scalability of the ILP Formulation: The classic Achilles' heel of ILP is its computational complexity. While the authors demonstrate impressive results with optimization times under 4 minutes per operator, this is for single operators. The complexity of the ILP problem can grow exponentially with the number of variables and constraints. It is unclear how this approach would scale to much larger, fused operators or if it were applied to an entire computational graph simultaneously, rather than on an operator-by-operator basis. The multi-operator optimization discussed in Section 5.6 is a good start but is not explored in depth.
Static Workload Assumption: The framework operates at compile-time and assumes that the dimensions of all tensors are known statically. This is a reasonable assumption for many vision models but is increasingly challenged by modern workloads, such as Transformers with variable sequence lengths or dynamic graph structures. The discussion in Section 7.7 acknowledges this, but it remains a significant practical barrier to applying OptiPIM directly in more dynamic execution environments.

Questions to Address In Rebuttal

The paper is excellent and I am strongly in favor of acceptance. The following questions are intended to help the authors strengthen the final version and contextualize the work's limitations.

On the Nature of Optimality: The paper focuses on optimizing individual operators. Could the authors comment on how the local optimality for each operator translates to global optimality for a full network? Is it possible that the optimal mapping for layer N results in a data layout that is highly inefficient for consumption by layer N+1, thereby requiring a costly data re-layout that negates the gains? Does the multi-operator formulation in Section 5.6 account for these inter-operator layout transformation costs?
On the Practicality for Compilers: Given the potential scalability concerns of ILP, do the authors envision OptiPIM being used directly in a production compiler, or do they see its primary role as an offline tool to generate a library of optimal mappings for common operator shapes? Alternatively, could the insights from the ILP solutions be used to train a machine learning model that could predict near-optimal mappings much more quickly?
On Architectural Generalization: The framework models two digital PIM architectures. How adaptable is the proposed representation and cost modeling to other classes of PIM, for instance, analog compute-in-memory (CIM) systems? What new constraints or cost functions (e.g., related to ADC/DAC precision, noise, or non-ideal device behavior) would need to be incorporated, and would the problem still be tractable within an ILP framework?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present OptiPIM, a framework for generating optimal software-hardware mappings for Processing-in-Memory (PIM) architectures using Integer Linear Programming (ILP). The central claim to novelty is not the use of ILP for mapping—a technique well-established in the ASIC accelerator domain—but rather the development of a novel, PIM-specific mapping representation and associated cost models that make the ILP formulation both feasible and accurate for the unique constraints of PIM. Specifically, the authors' claimed novel contributions are: (1) a layout-aware, multi-level nested loop representation that explicitly models PU allocation, column allocation, and computation scheduling; (2) a "Comprehensive Data Indexing" method that explores a bounded set of linear transformations for data indexing, moving beyond the rigid stride-1 indexing of prior work; and (3) a more accurate cost function for estimating data sizes, which is critical for modeling PIM's layout-dependent costs.

Strengths

The primary strength of this work, from a novelty perspective, lies in its precise and well-reasoned departure from prior art. The authors correctly identify that existing ILP-based mapping frameworks for ASICs (e.g., COSA [23]) are fundamentally unsuitable for PIM due to their assumptions about data layout and movement. The novel contributions directly address these identified gaps:

Comprehensive Data Indexing (Section 4.3, Page 6): This is the paper's most significant and genuinely novel contribution. Prior mapping frameworks typically assume a fixed, tiled indexing scheme (e.g., x = x_outer * tile_size + x_inner). The authors' formalization that multiple bijective linear transformations are possible and that for a 3-level tiling there are exactly 6 such combinations (Figure 6b) is a novel insight. Incorporating the choice between these functions as a variable within the ILP formulation is a powerful and original technique that significantly expands the explorable design space to find more optimal data layouts.
Accurate Cost Modeling for Unique Indices (Section 5.3.3, Page 8): The second key novel element is the proposed cost estimation function for the number of unique inputs generated by a linear combination of loop variables (Equation 11). The paper provides a clear empirical demonstration in Figure 7 that the prior art's method (product of loop bounds) is highly inaccurate for this task, whereas their proposed function is very precise. This is not merely an incremental improvement; it is a necessary correction to make the ILP objective function meaningful for PIM, where data movement costs stemming from layout are dominant.
Structured PIM Mapping Representation (Section 4.2, Page 5): While hierarchical loop representations are not new in themselves, the specific three-level structure proposed (PU Allocation, Column Allocation, Computation Scheduling) provides a clean and effective abstraction for the PIM mapping problem. It correctly separates the different granularities of parallelism and data locality inherent to PIM architectures, making it amenable to a concise ILP formulation. This structured approach is a novel formalization tailored specifically for the PIM domain.

Weaknesses

The paper's claims of novelty should be carefully contextualized. While the specific formulation is new, the foundational components are built upon extensive prior work.

Overstated Novelty of the ILP Approach: The use of ILP to solve the accelerator mapping problem is not novel. This has been the core technique in works like COSA [23] and LEMON [52]. The authors' novelty lies entirely in the adaptation and formulation of the problem for PIM, not in the choice of ILP as a solver. The abstract and introduction could be clearer on this distinction, emphasizing that the novelty is the representation and modeling that enables an effective ILP solution for PIM.
Insufficient Differentiation from PIM-Specific Mapping Frameworks: The paper primarily contrasts its work against ASIC mapping frameworks. However, other frameworks exist for PIM mapping, such as ARES [7], which also proposes a general representation. While the authors correctly state that ARES uses search-based algorithms, the paper would be stronger if it included a more direct technical comparison of its representation to that of ARES in the main body. A brief discussion on why the ARES representation is unsuitable for an exact ILP formulation, and how OptiPIM's novel representation overcomes this, would more firmly establish its unique contribution within the PIM-specific literature.

Questions to Address In Rebuttal

Scalability of Comprehensive Indexing: Section 4.3 and Figure 6 show that for 3 loop levels, there are 3! = 6 valid coefficient combinations to explore. This seems tractable. However, how does this novel aspect of the formulation scale as the number of tiled loop levels (l) increases? The number of combinations would grow as l!. At what point does this enumeration of indexing choices become a bottleneck that makes the ILP problem intractable to formulate or solve?
Novelty vs. ARES Representation: The related work section mentions ARES [7], which proposes a "general representation" for mapping operators onto diverse PIM architectures. Could the authors please elaborate on the fundamental limitations of the ARES representation that make it ill-suited for an ILP-based optimization approach? What specific, novel features of the OptiPIM representation enable the concise and efficient ILP formulation presented in this work, which would not be possible if starting from the ARES abstractions?
Justification for Linearity of Transformation: The novel indexing in Section 4.3 explores linear transformations of the form d*x2 + e*x1 + g*x0. Is there a theoretical or architectural justification for limiting the search to only linear functions? Have the authors considered whether non-linear indexing functions could provide further benefits, and if so, why were they excluded from the novel formulation?

HeterRAG: Heterogeneous Processing-in-Memory Acceleration for Retrieval-augmented Generation

Abstract

By integrating external knowledge bases,Retrieval-augmented Generation(RAG) enhances natural language generation for knowledge-intensive scenarios and specialized domains, producing content that is both more informative and personalized. RAG systems ...

ACM LINK

Reviews

Review 1

Review Form: The Guardian

Summary

This paper, "HeterRAG," proposes a heterogeneous Processing-in-Memory (PIM) architecture to accelerate Retrieval-augmented Generation (RAG) workloads. The authors identify that the two primary stages of RAG—retrieval and generation—have distinct system requirements. Retrieval is characterized by random memory access over large datasets, demanding high capacity, while generation is memory-bandwidth intensive. To address this, they propose a system combining low-cost, high-capacity DIMM-based PIM ("AccelDIMM") for the retrieval stage and high-bandwidth HBM-based PIM ("AccelHBM") for the generation stage. The system is further enhanced by three software-hardware co-optimizations: locality-aware retrieval, locality-aware generation, and a fine-grained parallel pipeline. The evaluation, conducted through a simulation framework, claims significant throughput and latency improvements over CPU-GPU and other PIM-based baselines.

Strengths

Well-Motivated Problem: The paper correctly identifies a critical and timely problem. The characterization of RAG workloads in Section 3, using execution breakdowns and roofline models (Figures 2, 3, 4), provides a solid foundation for the proposed solution. The analysis clearly establishes that both stages are memory-bound but with different characteristics, justifying a heterogeneous approach.
Logical High-Level Design: The core architectural concept—mapping the capacity-demanding retrieval stage to DIMM-PIM and the bandwidth-demanding generation stage to HBM-PIM—is sound. The authors rightly point out the futility of a naive approach where data is shuttled from DIMMs to HBM over a slow interconnect (Section 3.2), thereby motivating the need for compute capabilities on the DIMM side.
Inclusion of Relevant Baselines: The study includes "NaiveHBM" and "OnlyDIMM" baselines, which are crucial for validating the central hypothesis. The poor performance of NaiveHBM effectively demonstrates the interconnect bottleneck, while the comparison against OnlyDIMM helps isolate the benefits of using HBM for the generation stage.

Weaknesses

My primary concerns with this manuscript center on the validity of key performance claims, the ambiguity of critical mechanisms, and the overall robustness of the evaluation methodology.

Unsubstantiated Performance Claims and Questionable Scaling: The claim of "near-superlinear throughput improvement" for the retrieval stage (Section 5.4, page 11) is a significant red flag. Superlinear speedup is exceptionally rare and requires a strong theoretical justification, such as caching effects that scale non-linearly with the number of nodes. The paper attributes this to "data parallelism," which at best explains linear scaling. Without a rigorous explanation, this claim undermines the credibility of the entire evaluation. The headline performance numbers ("up to 26.5x") are also potentially misleading, as is common with "up to" metrics, and may not reflect average-case behavior.
Ambiguity in Core Optimization Mechanisms: The "fine-grained parallel pipeline" is presented as a key contribution (Section 4.4, page 9), but its implementation details are critically underdeveloped. The paper states the host "aggregates retrieval results at fixed intervals" and sends "high-confidence results" ahead. This is vague. How is the interval determined? What is the sensitivity of the system to this hyperparameter? What is the precise, non-heuristic logic for identifying a result as "high-confidence"? The efficacy of this entire optimization hinges on these details, which are absent from the paper.
Weakness of Evaluation Methodology: The entire evaluation rests on a simulation framework combining Ramulator and ZSim (Section 5.1, page 10). While simulation is a standard practice, this work fails to account for several real-world complexities:
- Interconnect Modeling: The high-level interconnect is specified as CXL, but the performance impact of the CXL switch network, protocol overhead, and coherence traffic is not discussed or apparently modeled in detail. These factors can introduce non-trivial latency and limit scalability.
- Baseline Hardware: The CPU-GPU baseline uses an NVIDIA V100 GPU. While a strong GPU, it is now two generations old. A comparison against a more contemporary architecture (e.g., H100) with significantly higher memory bandwidth and advanced features would provide a much more realistic assessment of HeterRAG's claimed benefits. The chosen baseline may artificially inflate the reported speedups.
- Comparisons to Prior Work: The comparisons in Section 5.5 are made against results reported in other papers. This is not a scientifically rigorous method, as underlying experimental assumptions (e.g., system configuration, simulator parameters, benchmarks) are invariably different. These comparisons are suggestive at best and cannot be considered conclusive proof of superiority.
Incremental Novelty of Components: While the system-level integration is novel, the individual components appear to be implementations of existing ideas. The locality-aware generation is explicitly "inspired by a recent study [87]," the PIM architectures build upon concepts from AttAcc [64] and Newton [25], and vertex caching for retrieval is a standard technique. The paper needs to more clearly articulate the novel architectural contributions beyond the high-level system concept.

Questions to Address In Rebuttal

The authors must provide clear and concise answers to the following questions to justify the paper's claims:

Regarding "Near-Superlinear" Scaling: Please provide a detailed, evidence-backed explanation for the observed near-superlinear scaling of the AccelDIMM devices. What is the underlying architectural or algorithmic phenomenon that causes the system to scale better than linearly? Standard data parallelism does not suffice as an explanation.
Regarding the Fine-Grained Pipeline: Please elaborate on the scheduling algorithm for the fine-grained pipeline. Specifically:
- How is the aggregation interval determined, and how sensitive is overall performance to this value?
- What is the exact criterion used by the host to classify a partial retrieval result as "high-confidence" and thus suitable for early forwarding to the generation stage?
Regarding Evaluation Baselines:
- Can you justify the choice of the V100 GPU as the primary baseline, given the existence of newer architectures with substantially higher memory bandwidth and compute power?
- In the "OnlyDIMM" baseline, how does the design of the bank-level generation unit (BPM) compare in terms of computational throughput and efficiency to the dedicated AccelHBM device? Please clarify if this is a fair, apples-to-apples comparison of generation capability.
Regarding Interconnect Overheads: The CXL interconnect is not lossless. What latency and bandwidth assumptions were made for the interconnect in your simulation, and how does system performance degrade as interconnect latency increases or effective bandwidth decreases due to protocol overhead?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents HeterRAG, a heterogeneous Processing-in-Memory (PIM) architecture designed to accelerate Retrieval-Augmented Generation (RAG) workloads. The core contribution is the insightful observation that the two primary stages of RAG—retrieval and generation—have fundamentally different memory requirements. Retrieval is characterized by large data capacity needs and irregular, random memory access patterns, while generation is dominated by high-bandwidth, more regular access.

Instead of proposing a monolithic PIM solution, the authors advocate for a specialized, heterogeneous system. They map the capacity-intensive retrieval stage to low-cost, high-capacity DIMM-based PIM (AccelDIMM) and the bandwidth-intensive generation stage to high-performance HBM-based PIM (AccelHBM). This architectural separation is complemented by several software-hardware co-optimizations, including locality-aware caching and a fine-grained pipeline to overlap retrieval and generation. The paper demonstrates through simulation that this approach significantly outperforms conventional CPU-GPU systems and more naive PIM configurations in terms of throughput, latency, and energy efficiency.

Strengths

Excellent Problem-Architecture Mapping (The Core Insight): The paper's primary strength is its clear-eyed identification of RAG as a workload with two distinct phases whose memory characteristics map beautifully onto the two major PIM technologies available today. The retrieval stage, with its massive knowledge bases, is a natural fit for the capacity and cost profile of DIMM-based PIM. The generation stage, bottlenecked by GEMV operations during autoregressive decoding, is a perfect candidate for the high bandwidth of HBM-based PIM. This is not just an application of PIM; it is a thoughtful synthesis of the right tool for the right job, which represents a significant step forward in thinking about system design for complex, multi-stage AI workloads.
Strong Grounding in a Critical Workload: The paper addresses a problem of immense practical importance. RAG is rapidly becoming the de facto standard for building knowledgeable and factual AI systems. By focusing on the system-level bottlenecks of this entire workflow, rather than just LLM inference in isolation, the work is highly relevant and has the potential for significant impact. The characterization study in Section 3.1 (page 4), using roofline models and execution breakdowns, provides a compelling, data-driven motivation for the proposed architecture.
Holistic and Plausible System Design: The authors present more than just a pair of accelerators; they propose a complete system. The inclusion of a CXL-based interconnect, a clear host-device execution flow (Section 4.1, page 5), and a conceptual software stack (Figure 11, page 9) shows a mature approach to system design. This holistic view makes the proposal more credible and provides a clearer path toward a real-world implementation.
Connects Disparate Research Threads: This work serves as an excellent bridge between two very active but often separate areas of architecture research: PIM for LLM inference (e.g., AttAcc, NeuPIMs) and PIM for graph/search algorithms (e.g., RecNMP). By building a system that requires both, HeterRAG effectively synthesizes techniques from both domains, demonstrating how they can be complementary components in a larger system. The authors explicitly acknowledge their debt to prior work (Section 4.2 and 4.3), which is commendable.

Weaknesses

While the core idea is strong, the work could be better contextualized and its limitations more thoroughly explored.

The Evolving Nature of RAG: The proposed architecture is tightly coupled to the current dominant RAG paradigm: graph-based ANNS for retrieval followed by autoregressive transformer decoding. However, the RAG space is evolving rapidly. Future techniques might involve different search indexes, simultaneous retrieval and generation, or non-autoregressive models. The paper could benefit from a discussion on the architecture's adaptability to these potential algorithmic shifts. How much of the proposed hardware is special-purpose versus programmable?
Understated Role of the Interconnect: The paper uses CXL as the interconnect fabric, which is a sensible choice. However, as the system scales up with many AccelDIMM and AccelHBM units, the all-to-some communication pattern (where retrieved results from many AccelDIMMs are gathered by the host and sent to AccelHBMs) could become a bottleneck. The analysis assumes this overhead is minimal, but a more detailed projection of interconnect traffic under high load would strengthen the scalability claims made in Section 5.4 (page 11).
Generalization Claims: In the discussion (Section 4.6, page 9), the authors suggest HeterRAG is well-suited for other workloads like graph processing and recommendation systems. While this is conceptually plausible, the paper is squarely focused on RAG. These claims, while interesting, are speculative without supporting data and might be better framed as promising avenues for future work.

Questions to Address In Rebuttal

The fine-grained parallel pipeline optimization (Section 4.4, page 9) is an elegant way to hide latency by overlapping retrieval and generation. Could you provide a more quantitative analysis of its benefit? For instance, what is the typical distribution of retrieval completion times within a batch for your test workloads, and how much of the potential idle time does this pipelining strategy effectively reclaim?
The architectural design makes a strong commitment to graph-based ANNS for retrieval. How would the AccelDIMM design need to change to support other popular retrieval methods, such as those based on inverted file indexes (IVF) or product quantization (PQ)? Is the core idea of a DIMM-PIM/HBM-PIM split robust to changes in the underlying retrieval algorithm?
Could the authors elaborate on the data path for the retrieved results? The text describes the host aggregating results from AccelDIMMs, mapping vector IDs to documents, and then sending tensors to AccelHBMs. At scale, could this "host-as-router" model become a performance or bandwidth bottleneck? Have you considered a more direct data path between the DIMM and HBM subsystems for certain RAG variants?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents HeterRAG, a heterogeneous Processing-in-Memory (PIM) system designed to accelerate Retrieval-Augmented Generation (RAG) workloads. The core architectural proposal is to combine two distinct types of PIM: a high-capacity, DIMM-based PIM (AccelDIMM) for the retrieval stage (specifically, graph-based ANNS) and a high-bandwidth, HBM-based PIM (AccelHBM) for the generation stage (LLM inference). The authors claim this heterogeneous approach overcomes the capacity/cost limitations of HBM-only systems and the bandwidth limitations of DIMM-only systems. The architecture is supported by three software-hardware co-optimizations: locality-aware retrieval (caching), locality-aware generation (a hardware-accelerated KV cache scheme based on prefix trees), and a fine-grained parallel pipeline to overlap the two stages.

Strengths

From a novelty perspective, the paper's strengths lie not in its high-level architectural concept, but in its specific, system-level co-design choices that are tightly coupled to the RAG workload.

Hardware Acceleration of a Recent Software Technique: The most novel contribution is the "locality-aware generation" mechanism (Section 4.4, page 8). The idea of combining prefix trees with selective recomputation for KV cache management is itself very recent, with the authors citing a 2025 paper [87] (CacheBlend). The design of dedicated hardware units (Tree Search Unit, KV Substitution Unit, Token Filtering Unit shown in Figure 9) to accelerate this specific software technique is a genuinely new hardware-software co-design contribution.
Nuanced Pipelining: The "fine-grained parallel pipeline" (Section 4.4, page 9) demonstrates a more sophisticated approach than simple coarse-grained overlapping of retrieval and generation. The proposed mechanism of periodically aggregating partial retrieval results and forwarding high-confidence candidates to the generation stage is a clever system-level optimization that exploits the known behavior of best-first graph search algorithms. This is a non-obvious and specific contribution.

Weaknesses

My primary concern with this paper is the overstatement of novelty regarding its core architectural framework. While the integration is complex and the results are compelling, the fundamental ideas are largely derivative of prior art.

The Heterogeneous PIM Concept is Not Fundamentally New: The central claim of novelty is a heterogeneous PIM system. However, this is a logical, if not obvious, application of existing principles. The RAG workload is cleanly divisible into two phases with opposing memory requirements: retrieval (high capacity, random access) and generation (high bandwidth, streaming access). Mapping these to DIMM-based PIM and HBM-based PIM, respectively, is a natural system design choice rather than a groundbreaking architectural innovation. The concept of heterogeneous computing and memory systems is well-established.
Component PIM Architectures Are Based Heavily on Prior Work:
- PIM for Retrieval: The use of DIMM-based PIM for accelerating ANNS has been explored. The related work section itself cites MemANNS [15] and DRIM-ANN [14], which use commercial DIMM-PIM for this task. The design of AccelDIMM (Section 4.2, page 6) is an engineering contribution that adopts established techniques like rank-level processing and instruction compression from prior works such as RecNMP [39] and TRiM [65]. The novelty delta here is incremental.
- PIM for Generation: Similarly, accelerating transformer inference with HBM-based PIM is a very active area of research. The paper explicitly states that AccelHBM (Section 4.3, page 7) "adopt[s] the same mapping scheme as AttAcc [64]" and draws inspiration from "Newton [25]". Therefore, the novelty of AccelHBM itself is minimal; it is an application of known techniques.
"Locality-Aware Retrieval" is Standard Practice: The first co-optimization, "locality-aware retrieval" (Section 4.4, page 8), is described as caching frequently accessed vertex vectors and reusing search results as starting points in iterative queries. These are standard caching and heuristic optimization techniques, respectively. While applying them is necessary for a high-performance system, it does not constitute a novel research contribution.

In essence, the paper combines two known PIM acceleration strategies (PIM for ANNS and PIM for Transformers) into a single system. The novelty is in the integration and the two more advanced co-optimizations, not in the headline architectural concept itself.

Questions to Address In Rebuttal

The authors should use the rebuttal to clarify and defend the precise novelty of their contributions.

The paper presents the heterogeneous HBM+DIMM architecture as its primary contribution. Given that the components (AccelHBM, AccelDIMM) are heavily based on prior art ([64], [39], [65], etc.), could the authors precisely articulate the novel architectural insight beyond the mapping of RAG stages to suitable memory technologies? What is the fundamental architectural challenge in this integration that this work is the first to solve?
Regarding the fine-grained pipeline (Section 4.4, page 9), prior work like PipeRAG [34] also proposes aggressive overlapping of retrieval and generation. Please clarify the key difference and novel step that your interval-based, confidence-aware aggregation mechanism provides over such prior art.
Could the authors re-frame their primary contribution? Is it the heterogeneous architecture itself, or is the main contribution the hardware-software co-design for locality-aware generation (accelerating [87]) and the fine-grained pipeline, which are enabled by a heterogeneous architecture? Clarifying this would help position the work more accurately within the literature.

ATiM: Autotuning Tensor Programs for Processing-in-DRAM

Abstract

Processing- in-DRAM (DRAM-PIM) has emerged as a promising technology for accelerating memory-intensive operations in modern applications, such as Large Language Models (LLMs). Despite its potential, current software stacks for DRAM-PIM face significant ...

ACM LINK

Reviews

Review 1

Reviewer: The Guardian

Summary

The authors present ATİM, a tensor compiler framework designed to autotune and generate code for Processing-in-DRAM (PIM) systems, specifically targeting the UPMEM architecture. The core contributions are threefold: 1) a unified search space that jointly optimizes host-side data distribution and kernel-side loop transformations; 2) a set of PIM-aware compiler optimizations, most notably for eliminating boundary check overheads; and 3) enhancements to the evolutionary search algorithm to better handle the expanded search space. The paper claims significant performance improvements over hand-tuned libraries (PrIM) and other baselines on both microbenchmarks and layers from the GPT-J model, evaluated on real UPMEM hardware.

Strengths

End-to-End System: The paper presents a complete, functional system that bridges high-level tensor abstractions down to executable code for a real, commercial PIM architecture. This represents a substantial engineering effort.
Real Hardware Evaluation: The primary performance evaluations are conducted on a physical UPMEM server (Section 6), which lends significant credibility to the reported latency numbers, as opposed to relying solely on simulation.
Well-Motivated Optimizations: The PIM-aware optimizations detailed in Section 5.3, particularly the analysis of boundary check elimination (Figure 4, Page 4), are technically sound and address a well-understood performance bottleneck on simple in-order cores like the UPMEM DPU.

Weaknesses

My primary concerns with this work center on the fairness of the experimental comparisons and the potential conflation of contributing factors, which may lead to an overstatement of the proposed system's core contributions.

Baseline Unfairness and the Source of Performance Gains: The headline performance gains (e.g., 6.18x for MTV, 8.21x for GPT-J layers) appear to stem not from a superior compiler per se, but from comparing a system capable of a powerful optimization (2D tiling with hierarchical reduction) against baselines that are artificially constrained. The authors themselves identify this as the key differentiator in Section 7.1 (Page 10): "By applying 2D tiling on both spatial and reduction loop dimensions, ATiM generates a sufficient number of smaller tiles..." The PrIM and even the authors' own "PrIM+search" baselines are limited to 1D tiling. This is not a like-for-like comparison of autotuning frameworks; it is a demonstration of the known benefits of a specific tiling strategy. An expert programmer could implement 2D tiling manually. Therefore, the comparison does not isolate the benefit of ATİM's autotuner from the benefit of a more advanced tiling strategy that the baselines were not configured to use. The work is effectively comparing two different classes of algorithms, which inflates the perceived contribution of the compiler framework itself.
Conflation of Search Space and Search Algorithm: The paper introduces both an expanded, joint search space (Section 5.2.1) and a modified search algorithm ("Balanced Evolutionary Search," Section 5.2.3). The results in Section 7.4 (Figure 14, Page 12) attempt to justify the new algorithm but do so by comparing ATİM's full solution against standalone components. A crucial piece of analysis is missing: a clear decoupling of the gains. The performance improvement could primarily come from the richer search space, with the algorithm providing only marginal benefit, or vice versa. Without an experiment that applies a baseline search algorithm (e.g., default TVM) to the new joint search space, it is impossible to attribute the performance gains correctly between these two distinct contributions.
Inconsistent Evaluation Methodology: For the main performance results in Section 7.1 and 7.2, the authors use real hardware. However, to evaluate the impact of their PIM-aware optimizations in Section 7.3 (Figure 13, Page 12), they switch to the uPIMulator simulator. This switch is not justified, nor is the simulator validated against the real hardware used elsewhere. Simulators can fail to accurately model memory access contention, DMA overheads, and other microarchitectural effects. Presenting critical performance breakdown data from an unvalidated simulator undermines the conclusions drawn about the real-world impact of these specific optimizations.
Unsupported Claims of Generality: The paper focuses exclusively on the UPMEM architecture. While the Discussion (Section 8, Page 12) speculates on extending ATİM to other architectures like HBM-PIM, these claims are entirely unsubstantiated. The current implementation, especially the lowering passes for host/kernel communication and DPU binding, is tightly coupled to UPMEM's programming model. The work as presented does not provide the necessary abstractions or evidence to support its portability.

Questions to Address In Rebuttal

Regarding the main performance claims in Section 7.1: Can you justify the decision to limit the PrIM+search baseline to 1D tiling? To demonstrate the value of your autotuner, would not a fairer comparison be against a PrIM kernel that is manually optimized by an expert using the same 2D tiling and hierarchical reduction strategy that ATİM discovers?
To decouple the contributions of your search space and search algorithm, please provide an ablation study. Specifically, what is the performance of a system using the default TVM evolutionary search on ATİM's proposed joint host-kernel search space? This would isolate the benefit derived purely from the expanded space.
Please justify the switch to uPIMulator for the analysis in Section 7.3. Can you provide any data correlating the performance characteristics (e.g., memory stall cycles, instruction mix) reported by the simulator with performance counters or observed behavior on the real UPMEM hardware used in Section 7.1?
The paper claims to establish a "foundation for advancing DRAM-PIM programmability" (Abstract, Page 1). Given that the current implementation is specific to UPMEM, what concrete, implemented abstractions in ATİM's design would facilitate porting it to a fundamentally different PIM architecture, such as Samsung's HBM2-PIM or SK Hynix's GDDR6-AiM, which use different execution and data mapping models?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents ATİM, a search-based, optimizing tensor compiler for Processing-in-DRAM (PIM) systems, specifically targeting the commercial UPMEM architecture. The work's essential contribution is bridging the gap between high-level tensor abstractions and the complex, low-level realities of PIM programming. It achieves this by extending the Apache TVM compiler framework to automate the co-optimization of both host and PIM-kernel code. Key innovations include (1) defining and exploring a joint search space that unifies host-side data distribution strategies with kernel-side loop transformations, (2) introducing PIM-aware compiler optimizations to mitigate hardware-specific bottlenecks like boundary checks, and (3) refining the evolutionary search algorithm to effectively navigate this expanded and complex optimization space. The authors demonstrate that this automated approach can generate code that significantly outperforms highly-optimized, hand-tuned libraries on a variety of tensor operations and layers from the GPT-J model.

Strengths

Addresses a Critical Problem: The primary obstacle to the widespread adoption of novel accelerators like PIM is not a lack of hardware potential, but the immense difficulty of programming them effectively. This paper directly confronts this programmability and performance portability crisis. By providing a fully automated path from a high-level tensor operation to optimized PIM code, ATİM represents a significant step towards making PIM a viable and accessible architectural player, rather than a niche curiosity.
Elegant Conceptual Framing: The central insight—that host and kernel code for PIM must be optimized jointly—is both correct and critical. Unlike a GPU where a powerful runtime and hardware scheduler abstracts away many data placement issues, PIM performance is deeply coupled to how the host partitions and distributes data across thousands of simple processing units (DPUs). The paper’s approach of repurposing TVM's schedule primitives to represent this joint search space (Table 2, page 6) is an elegant and powerful way to frame this complex co-design problem within a proven compiler paradigm.
Strong Connection to Broader Compiler Trends: This work fits perfectly within the modern compiler philosophy championed by systems like Halide and TVM, which advocate for separating the algorithmic specification ("what") from the performance schedule ("how"). ATİM successfully demonstrates that this philosophy is not only applicable but essential for taming the complexity of PIM. It serves as an excellent case study on how these domain-specific, search-based compilation techniques can be adapted to unlock the potential of new and unconventional hardware.
High-Impact Empirical Validation: The performance gains are substantial and compelling. Outperforming hand-tuned, vendor-adjacent libraries (like the PrIM benchmarks) by factors of up to 8.21x on real-world LLM kernels (Section 7.2, page 11) is a powerful statement. It validates the core hypothesis that a systematic, automated search can discover non-obvious optimizations that even human experts might miss, especially given the vast and interdependent parameter space.

Weaknesses

From the perspective of contextualizing the work's long-term impact, the weaknesses are less about flaws and more about opportunities for broader framing:

Hardware Specificity: The work is, by necessity, tightly coupled to the UPMEM architecture. While the discussion (Section 8, page 12) mentions extensibility to other PIM designs (e.g., HBM-PIM), the paper would be strengthened by a more explicit discussion of which principles are fundamental to PIM in general versus which are artifacts of UPMEM. For example, the joint host-kernel search is a universal PIM problem. However, the specific PIM-aware optimizations for boundary checks (Section 5.3, pages 7-8) are a direct consequence of UPMEM’s simple, in-order RISC cores. A clearer separation would help position ATİM as a foundational framework for PIM compilation, not just a UPMEM compiler.
The Cost of Automation: The paper rightly focuses on the performance of the generated code, but the cost of the autotuning process itself is a crucial practical barrier. The discussion section briefly notes the overhead is higher than for CPUs. This is a key finding. Quantifying this trade-off more formally (e.g., plotting performance gain vs. tuning time) would provide a more complete picture for practitioners and would situate the work in the broader context of "online" vs. "offline" compilation strategies for ML models.

Questions to Address In Rebuttal

The joint search space is the paper's most significant conceptual contribution. Can the authors provide a concrete example of a counter-intuitive trade-off discovered by ATİM? For instance, was there a case where the autotuner selected a less-efficient kernel configuration because it enabled a dramatically better host-side data distribution or reduction strategy, a solution a human programmer focused on kernel optimization might overlook?
The PIM-aware optimizations targeting boundary checks (Section 5.3) are fascinating, as they address the limitations of simple in-order cores. These architectural constraints are not unique to PIM; they are also common in resource-constrained hardware like embedded CPUs and edge AI accelerators. Could the authors comment on the potential for generalizing these specific tensor-level branch-hoisting and loop-tightening techniques beyond the PIM domain?
Looking forward, how would ATİM's fundamental abstractions need to evolve to support PIM architectures with different compute primitives, such as the fixed-function MAC units in Samsung's HBM-PIM or SK Hynix's GDDR6-AiM? Would the existing TVM schedule primitives for tiling and caching be sufficient, perhaps mapped to new semantics via the lowering process, or would this fundamentally different hardware model necessitate new high-level primitives in the search space?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents ATiM, a tensor compiler designed to generate optimized code for the UPMEM Processing-in-DRAM (PIM) architecture. The authors identify that existing PIM software stacks lack high-level abstractions and systematic optimization frameworks. ATiM's core proposal is to create a unified, search-based autotuning framework that co-optimizes host-level decisions (data distribution across PIM cores) and kernel-level loop transformations simultaneously.

To achieve this, the authors make three primary claims of novelty: 1. The formulation and exploration of a joint search space for host and kernel programs, enabled by repurposing the schedule primitives of the TVM tensor compiler. 2. A set of "PIM-aware" compiler optimizations at the tensor IR level to eliminate performance bottlenecks specific to simple in-order PIM cores, particularly redundant boundary checks. 3. An enhancement to the evolutionary search algorithm to counteract a sampling bias inherent to the expanded PIM search space.

The paper demonstrates that this approach can generate code that outperforms hand-tuned libraries for UPMEM. My review will focus exclusively on the novelty of these contributions relative to prior art.

Strengths

The primary conceptual novelty of this work is the elegant formulation of the joint host-kernel optimization problem. While autotuning for heterogeneous systems is a well-established field, the UPMEM architecture presents a unique challenge where the host's data distribution strategy is not merely a data-copying prelude but fundamentally defines the work performed by each kernel. The key insight and novel mechanism is the repurposing of TVM's existing schedule primitives to describe this coupled space (Section 5.2.1, page 5). Using primitives like split, reorder, and bind—traditionally used for loop transformations—to also define inter-DPU data tiling and mapping is a clever and non-obvious abstraction. It avoids the need to invent an entirely new scheduling language and instead extends the semantics of a known one. This is the paper's strongest and most original contribution.

The second area of notable novelty is in the PIM-aware optimizations (Section 5.3, pages 7-8). While the individual techniques (boundary check elimination, loop-bound tightening, invariant code motion) are known compiler concepts, their application at the TensorIR level is novel and well-justified. The authors convincingly argue that the high-level semantic guarantees of the tensor IR (e.g., knowledge of consumer operations from compute_at) enable transformations that are unsafe or intractable for a low-level compiler. The specific combination of invariant branch hoisting with what is effectively partial dead code elimination (Section 5.3.3) is a particularly strong example of exploiting high-level semantics for an aggressive optimization that a general-purpose compiler would likely avoid.

Weaknesses

While the core ideas are strong, the novelty of some components is incremental when deconstructed.

Joint Optimization as a Concept: The overarching idea of co-optimizing host and device code is not fundamentally new. Frameworks for heterogeneous computing have long grappled with decisions about data movement, tiling for memory hierarchies, and kernel scheduling. The novelty here is not the goal of joint optimization, but the specific formulation for the tightly-coupled PIM domain. The paper's contribution should be framed as a novel mechanism for a known problem, rather than the identification of the problem itself.
Search Algorithm Enhancements: The "improved search algorithms" described in Section 5.2.3 (page 7) consist of applying balanced sampling and an adaptive epsilon-greedy strategy. These are standard, well-known techniques from the fields of machine learning and search heuristics. The contribution is identifying a specific sampling bias (the "rfactor primitive bias") within their framework and applying an off-the-shelf solution. This is a sound engineering improvement necessary to make their system work well, but it does not represent a novel contribution to the field of search algorithms.
Marginal Benefit of Some Optimizations: The PIM-aware optimizations, while novel in their application context, must be weighed against their complexity. The experimental results in Figure 12 (page 12) show that the DMA-aware elimination provides the vast majority of the benefit. The subsequent Loop-bound tightening (LT) and Invariant branch hoisting (BH) provide smaller, though still positive, gains (often in the 5-15% range). The implementation complexity of these passes, especially the logic for hoisting combined with PDCE, may be substantial. For a novel technique to be significant, it should ideally provide a more transformative benefit. The case is made that these small gains matter on resource-constrained cores, but the contribution feels more incremental than foundational.

Questions to Address In Rebuttal

On Repurposing Primitives: The core claim rests on repurposing TVM primitives. Prior work in TVM and other tensor compilers already uses scheduling primitives to control data layout, tiling, and memory scope (e.g., mapping to shared vs. global memory in GPUs). Could the authors clarify the fundamental conceptual difference between mapping a tensor tile to a GPU thread block's shared memory (a standard practice) and mapping a tensor tile to a DPU's MRAM (the proposed technique)? Is the novelty simply the target (a PIM DPU) or is there a deeper semantic distinction in how the primitives are being interpreted for the host-level code generation?
On the Novelty of High-Level Optimization: The paper argues that optimizations like invariant branch hoisting with PDCE are enabled by the high-level semantics of TensorIR. However, polyhedral compilation frameworks (e.g., Polly [20], Pluto) also operate on high-level loop nest representations with perfect dependence information. Could a state-of-the-art polyhedral framework, when applied to the same problem, not derive an identical or functionally equivalent code transformation? What, specifically, does TensorIR enable that a polyhedral representation does not in this context?
On Generalizability of the Search Bias: The solution in Section 5.2.3 addresses the "rfactor primitive bias." Is this bias a fundamental property of the PIM optimization search space, or is it an artifact of the specific evolutionary search algorithm implemented in TVM/Ansor? If it is the latter, then the contribution is more of a patch for a specific framework's limitation rather than a novel, generalizable solution for PIM autotuning. Please clarify.

Single Spike Artificial Neural Networks

Abstract

Spiking neural networks (SNNs) circumvent the need for large scale arithmetic using techniques inspired by biology. However, SNNs are designed with fundamentally different algorithms from ANNs, which have benefited from a rich history of theoretical ...

ACM LINK

Reviews

Review 1

Review Form: Single Spike Artificial Neural Networks Reviewer Persona: The Guardian (Adversarial Skeptic)

Summary

The authors present a computing paradigm, "Single Spike Artificial Neural Networks," that attempts to merge traditional ANNs with the temporal dynamics of SNNs. The core concept is to represent ANN weights as programmable delay elements and activations as the arrival time of a single digital pulse ("spike"). The standard multiply-accumulate operation is mapped to a temporal sequence of delay (for multiplication) and a negative log sum exponential (nLSE) approximation (for accumulation). The authors propose architectural innovations, including a hybrid temporal/digital systolic array, an improved nLSE approximation circuit, and the integration of emerging temporal memories to claim significant improvements in energy efficiency (up to 3.5x over 8-bit digital) and latency (up to 4x over SNNs) on the MLPerf Tiny benchmark suite.

While the underlying mathematical transformation is coherent, the paper's claims rest on a foundation of brittle approximations, optimistic hardware assumptions, and an incomplete analysis of scalability and robustness. The work mistakes a clever but highly-constrained proof-of-concept for a generally applicable and robust solution.

Strengths

Consistent Mathematical Framework: The paper's primary strength is the formal mapping of standard ANN arithmetic into "delay space" (Table 1, page 2). The transformation of multiplication into addition (delay) and addition into nLSE is mathematically sound and provides a clear theoretical basis for the work.
Identification of Approximation Flaws: The authors correctly identify that prior nLSE approximations are insufficient and introduce bias. The investigation into a complementary "inverse approximation" to improve accuracy (Section 2.2, Figures 3 and 4) demonstrates a rigorous attempt to address a core weakness of the underlying approach.
Insightful Analysis of Data Distribution Impact: The analysis in Figure 10 (page 10) is a valuable contribution, as it clearly demonstrates why the proposed method fails on certain benchmarks. Linking the PE tree structure to the distribution of nLSE input differences provides a concrete explanation for the performance degradation on the Anomaly Detection task, revealing the method's sensitivity to network topology and data statistics.

Weaknesses

My primary concerns with this work are the fragility of the core computational primitive, an over-reliance on non-standard hardware to make its most significant claims, and an unconvincing treatment of noise and scalability.

The Core Approximation Lacks Generalizability and Robustness: The entire system is built upon the nLSE approximation, which is shown to be fundamentally brittle.
- Benchmark-Specific Failure: The paper's own results show a catastrophic failure on the Anomaly Detection benchmark, where accuracy drops from an original of ~0.9 to as low as 0.46 with a balanced PE tree (Table 2, page 9). The authors are forced to advocate for a specific, unbalanced 9-tree configuration to recover performance. This is not a sign of a "generally applicable" architecture; it is evidence of a highly-tuned method that is not portable across different network structures or problem domains.
- Extreme Sensitivity to Noise: The noise analysis in Figure 9 (page 10) is alarming. While the authors frame the voltage variation model as a "worst case scenario," a >15% drop in accuracy with a minor voltage swing indicates a system with razor-thin margins. Real-world systems experience power supply noise, and this level of sensitivity makes the practical deployment of this architecture highly questionable.
Headline Claims Hinge on Idealized Temporal Memory: The paper's most compelling claim—a 3.5x energy improvement over digital systolic arrays—is entirely dependent on the use of a memristive temporal memory (TM) system (Section 3.3, page 7). The digital memory (DM) version of their architecture (DS DM) shows performance that is, at best, on par with the digital baseline (Figure 12, page 11).
- This is a critical flaw in the paper's narrative. The authors are not comparing their novel compute paradigm to the state-of-the-art; they are comparing a system that combines their paradigm with a non-standard, emerging memory technology against a standard-memory baseline. Memristive devices suffer from well-known issues of process variability, limited write endurance, and precision degradation, none of which are adequately modeled or discussed. The claim of "7 bits of precision" (Section 3.3) is optimistic and ignores the significant overhead of the required analog control and readout circuitry.
Unconvincing Scalability and Cost Analysis: The paper glosses over fundamental scaling challenges that would arise in any practical implementation.
- Broadcast and Noise: The proposed broadcast of input signals from a central DTC (Section 3.1, page 5) is a known bottleneck. Figure 13 (page 11) shows the worst-case error growing with array width. The authors dismiss this by stating it is "well under the timing margin," but this linear trend suggests that for larger, more realistic array sizes required for non-trivial networks, this noise will become a dominant error source. The analysis is insufficient to prove scalability.
- Hidden Cost of Temporal Recurrence: The mechanism for temporal reuse via resynchronization (Section 3, page 5) is described as delaying a signal by a full cycle time T. This is a non-trivial operation. The claim that this delay can simply be "combined with the delay inherent to the nLSE approximation" is convenient but unsubstantiated. What happens if the nLSE computational path is very short? A long, power-hungry delay line would be required, the cost of which is not included in the energy analysis.

Questions to Address in Rebuttal

Given the dramatic failure of your method on the Anomaly Detection task with standard balanced tree structures, can you formally define the class of neural networks or data distributions for which your approximation is valid? How would a designer know a priori if your architecture is suitable for their model?
Please provide a more honest comparison against the digital baseline by focusing on the "DS DM" results. Can you justify the value of your approach when, using standard digital memories, its energy consumption is comparable to a conventional digital accelerator? Alternatively, can you provide data from fabricated temporal memory arrays that substantiate your claims of 7-bit precision and low energy, including all control and peripheral circuit overhead?
The "improved approximation" relies on a "temporal average" circuit (Section 2.2, page 4), which you state is achieved via nLSE(x, y) + ln 2. This is an exact mathematical operation. Please provide the circuit-level implementation of this and analyze its area, energy, and susceptibility to the very noise you show is detrimental in Figure 9. How "approximate" is this circuit in reality?
Please provide a quantitative analysis of the area and energy cost of the cycle-time delay T required for temporal recurrence. How often is the inherent computational delay insufficient, thereby requiring an explicit, long delay chain? How does this unaccounted-for cost impact your overall energy claims?

Review 2

Review Form: Synthesizer Persona

Summary

This paper presents a novel approach for neural network inference, termed "Single Spike Artificial Neural Networks," which compellingly bridges the gap between traditional Artificial Neural Networks (ANNs) and Spiking Neural Networks (SNNs). The core contribution is a mathematical framework and corresponding hardware architecture that executes standard ANN operations in a temporal, or "delay space," domain. By representing numerical values as the arrival time of a single digital pulse (a "spike"), the authors transform the fundamental multiply-accumulate (MAC) operation. Multiplication becomes simple addition of delays, and accumulation is realized through a hardware-efficient approximation of the negative log sum exponential (nLSE) function.

The authors build a complete system concept around this idea, proposing a hybrid temporal/digital systolic array that leverages classical dataflows (e.g., Weight Stationary) while performing computation using temporal primitives. Crucially, their approach can run pre-trained ANNs with minimal modification, thus inheriting the benefits of the mature ANN software and training ecosystem, a major advantage over conventional SNNs. The paper evaluates this approach across the MLPerf Tiny benchmark suite, demonstrating significant potential for energy efficiency gains, particularly when paired with emerging temporal memory technologies, where they project a 3.5x improvement over aggressive 8-bit digital designs.

Strengths

Fundamental Novelty and Conceptual Elegance: The paper's primary strength is its core idea. The mapping of ANN arithmetic to a logarithmic time domain (x -> -ln(x)) is both elegant and powerful. It establishes a direct, mathematically sound link between the two dominant paradigms of neural computation. This is not an incremental improvement but a new perspective on how to build neural accelerators, sitting at a fascinating intersection of temporal computing, logarithmic number systems, and traditional computer architecture.
Bridging the ANN-SNN Divide: A significant practical strength is the ability to leverage the vast ecosystem of ANN research. SNNs have long been hampered by immature training algorithms and a lack of standardized software. By providing a direct execution path for pre-trained ANNs, this work sidesteps that entire problem, making the energy benefits of spike-based computation immediately accessible. The discussion in Section 1 (page 1) effectively frames this motivation.
System-Level Approach: The authors go beyond a mere theoretical concept. They consider the full stack, from the circuit-level implementation of nLSE approximations (Section 2.2, page 4) to a programmable systolic array architecture (Section 3, page 5) and its dataflow (Section 4, page 7). The proposed integration with temporal memories (Section 3.3, page 7) is particularly forward-looking and essential to the claimed energy benefits.
Thorough and Realistic Evaluation: The evaluation is comprehensive. The authors analyze the impact of their approximations on accuracy (Section 6.1, page 8), including the effects of hardware noise (Figure 9, page 10) and architectural choices like PE tree size (Table 2, page 9). This detailed analysis lends credibility to their claims and provides valuable insights for future designers. The comparison against both a digital baseline and a state-of-the-art SNN accelerator (SATA) in Section 6.2 (page 10) effectively contextualizes their results.

Weaknesses

While the core idea is strong, the paper could be improved by addressing the following points, which are more about expanding the context and exploring limitations rather than fundamental flaws.

Limited Scope of Network Architectures: The evaluation is performed on the MLPerf Tiny suite, which consists of relatively small models. While appropriate for an initial investigation, the scalability of this approach to larger, more complex networks (e.g., Transformers, large CNNs) remains an open question. The dynamic range of activations and weights in such models could pose a significant challenge for the fixed-point temporal quantization scheme. The analysis on broadcast scaling in Section 6.3 (page 11) is a good start, but a broader discussion is needed.
Under-explored Connection to Logarithmic Number Systems (LNS): The paper implicitly reinvents many concepts from the well-established field of LNS hardware design. For instance, the difficulty of addition is the central challenge in LNS, just as nLSE is the central challenge here. Explicitly framing the work in the context of LNS could strengthen its theoretical foundation and allow it to draw from decades of research on LNS approximation techniques and error analysis.
Complexity of Temporal Synchronization: The paper mentions the need for resynchronization between cycles using "temporal recurrence" (Section 3.1, page 5). While the mechanism is described, the overheads and potential timing closure challenges of managing these precise delays across a large, potentially asynchronous array could be substantial. This feels like a critical implementation challenge that is perhaps understated.

Questions to Address In Rebuttal

Dynamic Range and Precision: Your temporal representation relies on a mapping from a numerical value to a physical delay. How does this system handle the wide dynamic range of values seen in larger models? Is the fixed-point quantization scheme (determined by the unit scale and number of bits for the programmable delay) sufficient for models beyond the embedded space, or would a form of temporal "block floating-point" be necessary?
The Cost of Conversion: The input data must be converted from the digital domain to the temporal domain (x -> -ln(x)). While this is a one-time cost for weights, it must be done for every input activation. Could you elaborate on the area and energy cost of this initial Digital-to-Time conversion and how it impacts the overall system efficiency, especially for input-bound layers?
Beyond ReLU and Max-Pooling: In Section 2.1 (page 3), you demonstrate elegant temporal implementations for ReLU and max-pooling. How would your approach handle other common non-linearities, such as GeLU or Swish, which are prevalent in modern architectures like Transformers? Do these functions have similarly efficient temporal implementations, or would they require costly conversions back to the digital domain?
Training in the Loop: You show that regularization during standard training improves robustness (Figure 8, page 9). Have you considered Quantization-Aware Training (QAT) where the specific nLSE approximation and noise models are included in the training loop? It seems this could be a powerful technique to close the remaining accuracy gap and would be a natural next step. Is there any fundamental reason this would not be feasible?

Review 3

Review Form: The Innovator (Novelty Specialist)

Summary

This paper proposes an approach to implement Artificial Neural Networks (ANNs) using temporal computing primitives. The core idea is to transform ANN operations into a logarithmic "delay space," where scalar values are represented by the arrival time of a single digital pulse ("spike"). In this domain, multiplication is implemented as a programmable delay, and accumulation is approximated by a negative log sum exponential (nLSE) function. The authors build a complete system around this concept, proposing a novel noise-tolerant nLSE approximation circuit, a hybrid temporal/digital systolic array architecture to support these operations programmably, and an evaluation of this system when integrated with emerging temporal memories. The work aims to bridge the gap between energy-efficient Spiking Neural Networks (SNNs) and the algorithmically mature world of ANNs.

Strengths

The novelty of this work lies not in the base concept, which has been previously explored, but in the specific architectural and circuit-level contributions required to make such a system viable and robust.

Novel Approximation Circuit: The most significant and clearly novel contribution is the improved nLSE approximation method detailed in Section 2.2 (page 4). The technique of creating a complementary "inverse approximation" and then performing a "temporal average" of the two to cancel out logarithmic bias is a clever and specific circuit-level innovation. This directly addresses a fundamental accuracy challenge in this domain and appears to be genuinely new.
Novel Hybrid Systolic Architecture: The proposed architecture in Section 3 (page 5) is a novel synthesis of concepts. While the systolic dataflow itself is not new, its implementation in a hybrid temporal/digital domain is. The methods for handling spatial reuse (broadcasting) and temporal reuse (resynchronization of temporal signals, discussed at the end of Section 3.1, page 5) are unique to the constraints of this single-spike computing paradigm and represent a new architectural pattern.
Novel System-Level Integration: The paper is the first to propose and evaluate the integration of memristive temporal memories (Section 3.3, page 7) into an end-to-end ANN accelerator. While the temporal memory itself is based on prior work [37], its use in this context to eliminate domain conversion overhead (TDCs/DTCs) within a systolic array is a novel systems-level contribution. This provides a clear application-driven context for a technology that has largely been demonstrated in isolation.

Weaknesses

The primary weakness of the paper from a novelty perspective is the presentation of the core transformational idea as entirely new, when it is, in fact, an extension of prior art, including the authors' own.

Incremental Nature of the Core Idea: The fundamental concept of "delay space arithmetic" for neural network operations is not introduced in this paper. It is a direct extension of the authors' previous work in [23] ("Energy Efficient Convolutions with Temporal Arithmetic"). That paper laid the groundwork for mapping convolutions into the temporal domain using the same delay-for-multiplication and nLSE-for-addition transform. This paper generalizes the concept to full ANNs and builds a programmable architecture, but the foundational mapping is not novel to this work. This should be made much clearer to properly frame the contributions.
Low Novelty of Non-Linear Operator Extensions: The extension of the framework to support non-linear operations like ReLU and max-pooling (Section 2.1, page 3) is presented as a key part of the contribution. However, these appear to be direct and mathematically straightforward consequences of the -ln(x) transformation. For instance, max(a, b) in the original domain naturally becomes min(-ln(a), -ln(b)) in the log-time domain. This is more of a necessary implementation detail for completeness rather than a significant conceptual advance.
Complexity vs. Benefit Justification: The proposed system introduces significant complexity (hybrid-domain operation, specialized analog-like circuits for approximation, critical timing dependencies) compared to a standard digital systolic array. The results in Figure 12 (page 11) show that the design with digital memory (DS DM) offers performance that is largely on par with, and in some cases less energy-efficient than, a conventional digital implementation. The substantial benefits only manifest with the integration of temporal memories (DS TM), an emerging and not-yet-mature technology. This makes the novelty of the architecture contingent on a future technology to justify its complexity, weakening the claim of a clear and present advancement over the state-of-the-art.

Questions to Address In Rebuttal

Please clarify the precise delta between the contributions of this paper and your prior work in [23]. The core arithmetic transformation appears identical. Is the novelty limited to the improved nLSE circuit, the generalization beyond convolutions, and the programmable systolic architecture?
Could the authors elaborate on the novelty of implementing ReLU and max-pooling in delay space beyond the fact that they are direct mathematical consequences of the logarithmic transform? Were there non-obvious implementation challenges or trade-offs that constitute a novel contribution?
Given that the proposed architecture with digital memory (DS DM) shows limited to no energy-delay-product (EDP) improvement over a standard 8-bit digital array, what is the compelling novelty-driven argument for adopting this significantly more complex temporal computing paradigm without relying on the future maturation and availability of high-precision, low-variability temporal memories?

Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks

Abstract

Spiking Neural Networks (SNNs) are gaining attention for their energy efficiency and biological plausibility, utilizing 0-1 activation sparsity through spike-driven computation. While existing SNN accelerators exploit this sparsity to skip zero ...

ACM LINK

Reviews

Review 1

Title: Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks Reviewer: The Guardian

Summary

The authors propose "Phi," a framework for accelerating Spiking Neural Networks (SNNs) by exploiting pattern-based hierarchical sparsity. The core idea is to decompose the binary spike activation matrix into two levels: Level 1 (vector-wise sparsity), which represents common activation row-vectors as pre-defined patterns whose results can be pre-computed, and Level 2 (element-wise sparsity), a highly sparse correction matrix to handle deviations from these patterns. The paper presents an algorithm-hardware co-design, including a k-means-based algorithm for pattern selection, a pattern-aware fine-tuning (PAFT) technique to increase pattern matching, and a dedicated hardware accelerator to process both sparsity levels efficiently. The authors claim significant speedup (3.45x) and energy efficiency improvements (4.93x) over the state-of-the-art SNN accelerator, Stellar [42].

Strengths

Principled Decomposition: The decomposition of the activation matrix into a structured, pattern-based component (Level 1) and an unstructured correction component (Level 2) is a logical approach. It attempts to handle the majority of computation via efficient table lookups (pre-computed PWPs) while managing outliers with a dedicated sparse engine.
Comprehensive Co-design: The work is commendably thorough, addressing the problem from the algorithmic level (pattern selection, fine-tuning) to the architectural level (preprocessor, L1/L2 processors). This end-to-end perspective is a clear strength.
Detailed Hardware Implementation: The authors provide a detailed description of their proposed hardware architecture, including the pattern matcher, the compressor/packer for Level 2 sparsity, and the reconfigurable adder tree. The design considerations for handling unstructured sparsity in the L2 processor are particularly well-articulated.

Weaknesses

My primary concerns with this work relate to the validity of the evaluation methodology, the justification for key design choices, and the practical implications of the proposed lossy compression scheme.

Fundamentally Flawed SOTA Comparison: The central claim of achieving a 3.45x speedup and 4.93x energy improvement hinges on the comparison with Stellar [42]. However, the authors explicitly state in Section 5.1, "For Stellar, we rely on the results reported in the paper." This is an unacceptable methodological flaw for a top-tier architecture conference. Comparing results from one's own simulator against numbers reported in another paper is not a valid, apples-to-apples comparison. Differences in simulator assumptions, process technology characterization (even when targeting the same node), memory models, and benchmark implementation details can lead to significant discrepancies. Without implementing Stellar within the same evaluation framework, the primary performance claims of this paper are unsubstantiated.
Weak Motivation and Over-reliance on Visualization: The entire premise is motivated by the t-SNE visualization in Figure 1, which purports to show that SNN activations are more "clustered" than DNN activations. t-SNE is a visualization technique notorious for creating the illusion of clusters where none may exist. It is not a rigorous method for cluster validation. The paper lacks any quantitative analysis (e.g., silhouette scores, variance analysis) to prove that these clusters are statistically significant and that the k-means approach is well-founded. The motivation rests on a subjective visual interpretation rather than rigorous data analysis.
Unjustified Accuracy-Performance Trade-off: The Pattern-Aware Fine-Tuning (PAFT) method introduces a non-trivial accuracy degradation. As shown in Figure 11, the accuracy drop for VGG16 on CIFAR-100 is approximately 1.5% (from ~92% for the lossless Phi to ~90.5% for Phi with PAFT). The authors dismiss this as a "minor decrease." This is a subjective judgment. A 1.5% absolute drop can be significant for many applications. This moves the work from a "lossless accelerator" to a "lossy co-design," which should be compared against other lossy techniques like quantization and pruning, not just other SNN accelerators. The paper fails to adequately position and justify this trade-off.
Underestimated Hardware and Storage Overheads:
- The preprocessing logic, particularly the pattern matcher and the packer, appears complex. The pattern matcher must compare each activation row against 128 stored patterns (Section 5.2.2). While implemented as a systolic array, this still represents a substantial area and power cost that is not sufficiently analyzed. The benefit analysis in Section 6.1 feels like a post-hoc justification rather than an integral part of the evaluation.
- The framework requires storing q=128 patterns of length k=16 for each layer and partition. For deep and wide networks, this calibration data could become substantial. The paper does not analyze the total storage cost of these patterns or the overhead of loading them for each layer.
Limited Scalability of the Calibration Process: The paper states that patterns are selected independently for each "model, dataset, layer, and partition" (Section 3.2). This offline calibration step seems computationally intensive and data-dependent. It raises questions about the framework's adaptability. How does Phi perform on a model trained on one dataset but deployed for inference on a slightly different, out-of-distribution dataset? The tight coupling between the calibrated patterns and the training data distribution may represent a significant practical limitation.

Questions to Address In Rebuttal

Please provide a compelling justification for comparing your simulated results against the reported results of Stellar [42]. Given the methodological invalidity of this approach, how can the authors stand by their SOTA claims? A fair comparison would require implementing Stellar within your simulation framework.
Can you provide a quantitative, statistical analysis of the clustering of SNN activation vectors that goes beyond the subjective t-SNE visualization in Figure 1? This is critical to establishing the foundation of your work.
The PAFT fine-tuning results in a lossy scheme. How does the resulting accuracy/performance trade-off compare to established SNN compression techniques like pruning? Why should the community accept a ~1.5% accuracy loss for a 1.26x speedup (as per Section 5.4)?
Provide a detailed area and power breakdown of the entire Preprocessor (Matcher, Compressor, Packer, and associated control logic). How does this overhead scale with the number of patterns (q) and the partition size (k)? Show that the preprocessing overhead does not dominate the overall energy savings for layers with low computational intensity.
What is the total storage footprint for all calibrated patterns across a model like ResNet-18 or VGG-16? Please clarify how this data is managed and loaded during network execution and account for its overhead.

Review 2

Review Form: The Synthesizer

Summary

This paper introduces Phi, a novel framework for accelerating Spiking Neural Networks (SNNs) by exploiting a higher-order structure in their activations than has been previously considered. The authors' core observation is that the binary spike activations in SNNs are not randomly distributed but form distinct, recurring patterns or clusters (visualized effectively in Figure 1c, page 2).

Building on this insight, they propose a "pattern-based hierarchical sparsity" that decomposes the activation matrix into two levels: 1. Level 1 (Vector Sparsity): A dense matrix of indices pointing to a small, pre-defined codebook of common activation patterns. The computation for these patterns against the weight matrix is pre-calculated offline, converting most runtime computation into memory lookups. 2. Level 2 (Element Sparsity): A highly sparse correction matrix containing {+1, -1} values to account for the differences (outliers) between the actual activations and the matched patterns.

The paper presents a full algorithm-hardware co-design, including a k-means-based algorithm for discovering patterns and a dedicated hardware architecture to efficiently process both levels of sparsity on the fly. The authors report significant improvements in speed (3.45x) and energy efficiency (4.93x) over the state-of-the-art SNN accelerator, Stellar.

Strengths

A Fundamental and Elegant Insight: The primary strength of this work lies in its foundational contribution. The SNN accelerator community has largely focused on optimizing for bit sparsity (skipping zero-activations). This paper makes a conceptual leap by identifying and exploiting vector sparsity (skipping computation for entire recurring patterns). This shifts the optimization target from individual bits to meaningful information chunks, which is a powerful and elegant new perspective. The idea feels fundamental and has the potential to become a standard technique in the field.
Bridging Concepts Across Domains: This work serves as an excellent bridge between several important areas of research. The core mechanism is conceptually analogous to Vector Quantization (VQ) or dictionary learning, where a codebook of representative vectors is used to compress information. The Level 2 correction matrix is effectively a clever way to handle the quantization error in hardware. More importantly, this work has strong and immediate relevance to the acceleration of low-bit Deep Neural Networks (DNNs). As the authors rightly point out in Section 6.2 (page 12), techniques like bit-slicing decompose multi-bit DNN matrices into binary ones. The Phi framework could therefore be a highly effective mechanism for accelerating these bit-sliced DNNs, giving the work significance far beyond the SNN niche.
Comprehensive Co-Design: The authors present a convincing end-to-end solution. They do not merely propose a software algorithm but have clearly thought through the architectural implications. The design of the Preprocessor (Figure 4, page 7) to handle dynamic pattern matching and the separate, specialized L1 and L2 Processors to handle structured lookups and unstructured corrections, respectively, demonstrates a mature and holistic approach to the problem. This makes the proposed performance gains far more credible than a purely algorithmic study would.

Weaknesses

While the core idea is strong, the paper could be strengthened by further exploring the boundaries and context of the contribution.

Limited Exploration of Pattern Genesis: The paper observes the existence of patterns but does not deeply investigate why these patterns emerge. Is this phenomenon inherent to the Leaky-Integrate-and-Fire (LIF) neuron dynamics? Is it a byproduct of certain training methods (e.g., surrogate gradients)? Understanding the origin of this structure would strengthen the theoretical underpinnings and help predict how well the technique might generalize to future SNN models and neuron types.
Positioning Could Be Broader: The authors connect their work to bit-slicing in the discussion, but this connection is so powerful that it deserves to be highlighted earlier and more prominently. Framing the work from the outset as a general technique for "pattern-based binary matrix computation" would better capture its potential impact for the wider computer architecture community, which is increasingly focused on extreme quantization in DNNs.
Static Nature of the Pattern Codebook: The proposed framework relies on a static, pre-calibrated set of patterns for each layer. While this is a practical starting point, it raises questions about adaptability. For applications with significant domain shift or in continual learning scenarios, a static codebook might become suboptimal. A brief discussion on the potential for dynamic or adaptive pattern updates would add a valuable forward-looking perspective.

Questions to Address In Rebuttal

The core observation of clustered activation patterns in Figure 1c is compelling. Can the authors provide some intuition or evidence on how universal this property is? For instance, do these well-defined clusters persist across different SNN architectures (e.g., deep CNNs vs. Transformers like Spikformer) and across different datasets (e.g., temporal event-based data like CIFAR10-DVS vs. static images)?
The paper partitions the activation matrix along the K-dimension with a fixed size k=16 (Section 5.2.1, page 10). What is the architectural and algorithmic trade-off here? Would partitioning into 2D blocks or along the activation channel dimension yield a different, perhaps richer, set of patterns? A deeper justification for this 1D partitioning would be helpful.
Regarding the connection to quantized DNNs: If one were to apply the Phi framework to a 4-bit weight-and-activation DNN by bit-slicing the activation matrix, how would the proposed approach compare to other state-of-the-art 4-bit accelerators? Specifically, would the overhead of pattern matching and handling two sparsity levels still be advantageous compared to specialized hardware for direct 4-bit multiply-accumulate operations?

Review 3

Paper Title: Phi: Leveraging Pattern-based Hierarchical Sparsity for High-Efficiency Spiking Neural Networks Review Form: The Innovator (Novelty Specialist)

Summary

The authors introduce "Phi," a framework for accelerating Spiking Neural Networks (SNNs) by exploiting a novel form of sparsity they term "pattern-based hierarchical sparsity." The core idea is to decompose the binary spike activation matrix into two components. Level 1 represents rows of the activation matrix (vectors) that closely match a pre-defined dictionary of patterns, enabling the use of pre-computed partial results. Level 2 is a highly sparse correction matrix, containing {+1, -1} values, that accounts for the differences (or "residuals") between the actual activations and the matched patterns. The framework includes an algorithmic component for pattern discovery using a k-means-based approach and a hardware co-design featuring a dedicated architecture to process both levels of sparsity efficiently at runtime. The authors claim significant speedup and energy efficiency improvements over state-of-the-art SNN accelerators.

Strengths

The primary strength of this paper lies in its specific and well-executed application of established compression principles to the unique domain of SNN activations.

Novel Application Domain: While the constituent ideas are not entirely new to computer science (as detailed below), their application to the binary, event-driven activation matrices of SNNs is novel. The observation that SNN activations exhibit strong clustering behavior (Figure 1c, page 2) is a key insight, and building a full hardware/software stack around it is a significant contribution.
Elegant Residual Representation: The use of a {+1, -1} basis for the Level 2 correction matrix is an elegant and efficient mechanism for representing the residual in a binary space. It naturally handles both types of mismatches (1 in the activation but 0 in the pattern, and vice versa) and is well-suited for hardware implementation.
Comprehensive Co-Design: The work presents a complete co-design, from the algorithmic pattern selection method to a detailed hardware architecture. This demonstrates a thorough understanding of the problem and provides a convincing case for the framework's feasibility.

Weaknesses

The paper's primary weakness, from a novelty standpoint, is its failure to adequately contextualize its core mechanism within the broader history of data compression and quantization. The authors present the concept of pattern-based, hierarchical decomposition as a fundamentally new idea, which it is not.

Conceptual Overlap with Vector Quantization (VQ): The core idea of Level 1 is functionally identical to Vector Quantization, a concept that dates back decades. In VQ, a "codebook" of representative vectors is created (analogous to Phi's "pre-defined patterns"), and input vectors are replaced by the index of the closest codebook entry. The k-means algorithm, which the authors use for "pattern selection" (Section 3.2, page 5), is the standard algorithm for generating VQ codebooks. The paper does not mention VQ, which is a significant omission of prior art.
Conceptual Overlap with Residual/Hierarchical Compression: The two-level Phi sparsity is a form of residual or multi-stage compression. The Level 1 pattern provides a coarse approximation of the activation vector, and the Level 2 matrix provides a fine-grained residual correction. This concept is the foundation of techniques like Residual Vector Quantization (RVQ) and other hierarchical decomposition methods used widely in signal processing and data compression. The paper presents this hierarchy as a novel invention rather than a novel application of a well-known principle.
Insufficient Discussion of Prior Art in the DNN Space: While the paper does compare itself to SNN accelerators, its discussion in Section 6.2 ("Relationship with Sparsity and Quantization in DNNs," page 12) misses the most relevant conceptual predecessors in the conventional DNN space. It compares Phi to zero-skipping and bit-slicing but fails to discuss works that use VQ or other dictionary-based methods on weights or activations in traditional DNNs. Acknowledging and differentiating from these works is critical for properly situating the paper's contribution.

The novelty here is not the invention of a pattern+residual scheme, but its specific adaptation and hardware implementation for the unique constraints and opportunities of binary SNN activations. The paper would be substantially stronger if it framed its contribution as such, rather than implying the invention of the core concept itself.

Questions to Address In Rebuttal

Could the authors please clarify the novelty of their pattern-based approach in relation to classical Vector Quantization (VQ) and Residual Vector Quantization (RVQ)? How does Phi's Level 1 (pattern matching) and Level 2 (correction matrix) conceptually differ from a one-stage RVQ where the codebook is derived from k-means clustering?
The offline calibration step creates a static dictionary of patterns based on a training subset. This seems vulnerable to distribution shift between the calibration set and unseen inference data. Have the authors analyzed the robustness of their selected patterns? How much does performance degrade if the activation patterns at inference time differ significantly from those seen during calibration?
The use of pre-computed Pattern-Weight Products (PWPs) trades computation for memory traffic. The paper notes this requires a PWP prefetcher to manage the "heavy memory traffic induced by PWPs" (Section 5, page 9). Could the authors provide a more detailed analysis of this trade-off? Specifically, for very large models or layers, could the storage and bandwidth requirements for PWPs become a new bottleneck that negates the computational savings?

Bishop: Sparsified Bundling Spiking Transformers on Heterogeneous Cores with Error-constrained Pruning

Abstract

Spiking neural networks(SNNs) have emerged as a promising solution for deployment on resource-constrained edge devices and neuromorphic hardware due to their low power consumption. Spiking transformers, which integrate attention mechanisms similar to ...

ACM LINK

Reviews

Review 1

Paper Title: Bishop: Sparsified Bundling Spiking Transformers on Heterogeneous Cores with Error-constrained Pruning Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present Bishop, a comprehensive HW/SW co-design framework for accelerating spiking transformers. The proposal includes a new data container called Token-Time Bundle (TTB) to manage spatiotemporal workloads, two algorithmic optimizations—Bundle Sparsity-Aware (BSA) training and Error-Constrained TTB Pruning (ECP)—and a heterogeneous hardware architecture composed of dense, sparse, and dedicated attention cores. While the ambition to create a full-stack solution for this emerging model class is noted, the work is undermined by several critical flaws, most notably an inappropriate choice of baselines that likely inflates performance claims, unsubstantiated assertions regarding the "error-constrained" nature of their pruning method, and a lack of evidence for the system's robustness beyond a narrow set of highly-tuned configurations.

Strengths

Ambitious Scope: The paper commendably attempts to address the acceleration of spiking transformers from the algorithm level down to the microarchitecture, which is the correct approach for a co-design framework.
Structured Workload Management: The central concept of the Token-Time Bundle (TTB) provides a structured primitive for managing sparse spatiotemporal workloads. This approach logically facilitates data reuse and enables coarse-grained computation skipping, which is a sound principle for accelerator design.
Problem Motivation: The workload analysis presented in Section 2.2 and Figure 3 correctly identifies that MLP and projection layers, not just the attention mechanism, constitute a significant computational bottleneck in spiking transformers, providing a solid motivation for the overall architectural design.

Weaknesses

This paper suffers from significant methodological and analytical weaknesses that call its central claims into question.

Fundamentally Flawed Baseline Comparisons: The claimed speedup and energy efficiency improvements (5.91x and 6.11x) are built on a foundation of inappropriate and poorly-defined baselines.
- PTB [26] is a Spiking CNN Accelerator: The primary hardware baseline, PTB, was designed for the regular, structured dataflow of convolutional layers. Spiking transformers are dominated by matrix-matrix multiplications (in MLP/projection) and the highly irregular, data-dependent communication patterns of self-attention. Comparing a specialized transformer accelerator to a CNN accelerator is an apples-to-oranges comparison. The data movement, memory access patterns, and compute granularities are fundamentally different. Any reported speedup is therefore suspect, as the baseline is not architecturally suited for the target workload. A valid comparison would require adapting a state-of-the-art ANN transformer accelerator to the accumulate-only nature of SNNs or using a more flexible, state-of-the-art SNN accelerator capable of handling FC-like layers efficiently.
- Weak and Opaque GPU Baseline: The "Edge GPU" baseline is a NVIDIA Jetson Nano, a low-power device from 2019. This is not a competitive baseline for demonstrating state-of-the-art performance. Furthermore, the authors provide no details on the software implementation. Was it a naive PyTorch implementation, or were optimized kernels (e.g., using CUTLASS, cuSPARSE) or TensorRT employed? Without these details, the GPU comparison is unverifiable and likely represents a lower bound on achievable performance.
Unsubstantiated "Error-Constrained" Pruning Claim: The abstract and introduction prominently feature "Error-Constrained TTB Pruning (ECP)" with a "well-defined error bound." However, the paper completely fails to substantiate this claim.
- Section 5.1 introduces a pruning threshold θp but provides no mathematical formulation linking this threshold to any analytical error bound on the output of the attention layer or the model's final accuracy. The term "error-constrained" implies a formal guarantee or control mechanism, which is absent.
- The methodology appears to be simple threshold-based magnitude pruning, where the threshold is empirically swept to find a value that doesn't excessively degrade accuracy (as shown in Figure 14). This is empirically-tuned pruning, not error-constrained pruning. The claim is a misrepresentation of the method.
- Pruning the values (V) is particularly dangerous as it directly removes information from the feature representation. The authors provide no analysis of how ECP avoids catastrophic information loss, relying only on a qualitative image (Figure 8) as evidence.
Lack of Robustness and High Sensitivity to Hyperparameters: The proposed system introduces numerous hyperparameters (TTB size (BSt, BSn), stratification threshold θs, pruning threshold θp), and the paper's own analysis suggests the system is brittle.
- Figure 15 shows that the Energy-Delay Product (EDP) is highly sensitive to the stratification threshold θs. A deviation of the threshold to 20% or 80% results in a significant performance degradation. This indicates that the heterogeneous architecture requires careful, layer-wise tuning and may not be robust to workload variations.
- Similarly, Figure 16 demonstrates a very narrow "sweet spot" for the TTB volume. This suggests the system is over-fitted to the specific model architectures and datasets tested and undermines claims of general applicability. A truly robust system would exhibit more graceful performance degradation outside the optimal parameter range.
Insufficient Architectural Justification and Overhead Analysis:
- The paper does not provide an ablation study to justify its key architectural decision: heterogeneity. Would a larger, homogeneous core (either sparse or dense) with the same total area/power budget perform better or worse? The necessity of the complex three-core (dense, sparse, attention) design plus a stratifier is assumed, not proven.
- The overheads associated with the TTB framework are ignored. Managing bundles requires metadata, indexing logic, and packing/unpacking operations. The area, power, and latency costs of this TTB management logic and the stratifier are not detailed in the breakdown in Figure 17, which is a critical omission for a hardware paper.

Questions to Address In Rebuttal

The authors must provide clear and convincing answers to the following questions:

Baselines: Please provide a detailed justification for using a Spiking CNN accelerator (PTB [26]) as the primary baseline for a Spiking Transformer accelerator. To make your claims credible, please provide a comparison against a more architecturally relevant baseline (e.g., an SNN-adapted ANN Transformer accelerator). For the GPU baseline, please specify the exact software stack and optimization level used and justify why the Jetson Nano is a representative platform.
Error-Constrained Pruning: Provide the precise mathematical definition of the "well-defined error bound" for ECP that you claim exists. How is this error bound analytically linked to the pruning threshold θp? If no such analytical link exists, please retract the "error-constrained" claim and re-frame it as an empirical technique.
Robustness: Your performance results appear highly sensitive to the choice of stratification threshold and TTB volume. How would the optimal parameters determined for one model (e.g., Model 3 on ImageNet-100) perform on another (e.g., Model 5 on Google SC) without re-tuning? Please provide evidence to support the generalizability of your approach.
Architectural Overheads: Please provide an ablation study that justifies the necessity of your heterogeneous core design over a simpler, homogeneous architecture of equivalent area. Furthermore, provide a quantitative breakdown of the overheads (area, power, and latency) incurred by the TTB management logic and the workload stratifier.

Review 2

Paper Title: Bishop: Sparsified Bundling Spiking Transformers on Heterogeneous Cores with Error-constrained Pruning Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Bishop, a comprehensive HW/SW co-design framework for accelerating spiking transformers. The authors correctly identify that this emerging class of models, while promising, cannot be efficiently executed on existing SNN or ANN accelerators due to their unique spatiotemporal workload characteristics.

The core contribution is the introduction of the Token-Time Bundle (TTB), a data abstraction that groups spikes across both tokens and time steps. This single, powerful idea serves as the foundation for the entire system. Built around the TTB, the authors propose a holistic solution: 1. A heterogeneous accelerator architecture featuring a stratifier to route TTBs to specialized dense and sparse processing cores, maximizing efficiency based on activation density. 2. A novel Bundle Sparsity-Aware (BSA) training algorithm that encourages structured, TTB-level sparsity, making the workload more amenable to acceleration. 3. An Error-Constrained TTB Pruning (ECP) technique to reduce computation in the costly self-attention mechanism by selectively trimming query, key, and value bundles. 4. A dedicated, reconfigurable attention core that leverages the binary nature of spikes to perform attention calculations using simple AND/Accumulate operations.

The authors demonstrate that this co-designed approach yields significant improvements in speedup (5.91x) and energy efficiency (6.11x) over prior SNN accelerators.

Strengths

A Foundational Abstraction for a New Problem Domain: The paper’s greatest strength is its identification of a new, important problem—the acceleration of spiking transformers—and its proposal of a clear, foundational solution. The Token-Time Bundle (TTB) concept (Section 3.2, page 5) is an elegant way to manage the complex spatiotemporal sparsity of these models. It transforms an unstructured, fine-grained problem into a structured, coarse-grained one, which is vastly more amenable to hardware optimization. This abstraction could very well become a standard way of reasoning about and processing these workloads in the future.
Exemplary HW/SW Co-design: This work is a textbook example of a holistic, full-stack approach. Rather than designing hardware for a fixed algorithm, the authors modify the algorithm itself to suit the hardware. The BSA training pipeline (Section 4.1, page 5) actively creates the structured sparsity that the heterogeneous cores are designed to exploit. Similarly, the ECP technique (Section 5, page 6) is a model-level optimization that directly maps to reduced hardware activity in their custom attention core. This synergy between algorithm and architecture is what leads to the impressive results and is a model for future research in the domain.
Novel and Well-Motivated Architectural Decisions: The architecture is not a monolithic design but a thoughtful composition of specialized units. The use of a "stratifier" to dispatch workloads to either a dense or sparse core (Figure 9, page 8) is a direct and intelligent response to the varying activation densities found in spiking workloads. This is a significant step beyond homogeneous SNN accelerator designs. Furthermore, the design of the TTB spiking attention core, which replaces expensive multiplications with bitwise operations, correctly identifies and tackles the primary computational bottleneck in transformers.

Weaknesses

While this is a strong and well-executed paper, its context within the broader landscape of AI acceleration could be strengthened.

Missing Contextual Comparison to ANN Transformer Accelerators: The paper's primary baseline is a prior SNN accelerator (PTB) [26]. While this is a necessary and fair comparison within the neuromorphic field, it leaves a critical question unanswered. Spiking transformers are ultimately competing with conventional ANN transformers on performance and efficiency. A crucial piece of context would be to compare Bishop's end-to-end efficiency (e.g., energy-per-inference for a given accuracy) against a state-of-the-art sparse ANN transformer accelerator (running an appropriately quantized and pruned ANN-ViT). Without this, it is difficult to assess whether the entire SNN-based approach, even when highly optimized, provides a true efficiency advantage over the incumbent ANN paradigm.
Scalability and Overhead of the TTB Abstraction: The TTB is a powerful concept, but its practical implementation involves overheads (e.g., metadata for active/inactive bundles, routing logic in the stratifier). The paper evaluates models of moderate scale. A discussion on how this management overhead scales to much larger transformer models (e.g., with thousands of tokens or hundreds of layers) would be valuable. Does a point exist where managing the bundles becomes a bottleneck itself, or where the "sweet spot" for bundle size (Figure 16, page 12) changes dramatically?
Clarity on the "Error-Constrained" Pruning: The abstract promises a "well-defined error bound" for the ECP technique. However, the description in Section 5 (page 6) presents it as a process of trimming bundles based on user-specified thresholds (θp, θq). It is not immediately clear how these thresholds translate to a formal, predictable error bound on the model's output. Is this an empirically derived relationship, or is there a theoretical grounding? Clarifying this would strengthen the claims of the ECP method.

Questions to Address In Rebuttal

Could the authors provide an estimate, even if it's a "back-of-the-envelope" calculation, of how Bishop's energy-delay product (EDP) for a task like CIFAR100 would compare to a leading sparse ANN transformer accelerator executing a similarly performing, quantized MobileViT or DeiT model? This would help position the work in the broader context of efficient AI.
Regarding the ECP mechanism, can you elaborate on the process of determining the pruning thresholds? How is the "error bound" established and maintained? Is it a hard constraint, or a target that is achieved via iterative training and tuning?
The TTB concept seems broadly applicable to spatiotemporal workloads. Have the authors considered its applicability beyond spiking transformers, perhaps to other event-based models like spiking LSTMs or models for dynamic vision sensor processing? A brief comment on the generality of the core idea would enhance the paper's impact.

Recommendation: Accept. This is a pioneering work that provides the first comprehensive acceleration framework for an important and emerging class of models. The core TTB abstraction is novel and powerful, and the holistic HW/SW co-design is executed exceptionally well. This paper will likely be highly influential in the fields of neuromorphic engineering and specialized computer architecture.

Review 3

Paper Title: Bishop: Sparsified Bundling Spiking Transformers on Heterogeneous Cores with Error-constrained Pruning Reviewer Persona: The Innovator (Novelty Specialist)

Summary

The authors present Bishop, a hardware/software co-design framework for accelerating spiking transformers. The core proposal consists of several interconnected components: (1) a data container called Token-Time Bundle (TTB) to group spatiotemporal workloads for data reuse; (2) a heterogeneous accelerator architecture with a "stratifier" to route TTBs to either a dense or a sparse processing core; (3) a Bundle Sparsity-Aware (BSA) training algorithm to induce structured sparsity at the TTB level; (4) an Error-Constrained TTB Pruning (ECP) technique to prune low-activity queries and keys in the attention mechanism; and (5) a dedicated spiking attention core that uses simplified AND-Accumulate operations. The authors claim this is the first dedicated accelerator framework for spiking transformers and demonstrate significant speedup and energy efficiency gains over prior SNN accelerators.

Strengths

The primary strength of this work lies in the tight integration and synthesis of its components. While individual concepts may have precedents, the authors have constructed a cohesive end-to-end system where the software optimizations (BSA, ECP) are explicitly designed to create data structures (sparse TTBs) that the hardware architecture (heterogeneous cores) is specifically built to exploit. This holistic co-design approach for the niche but growing domain of spiking transformers is commendable.

Weaknesses

My evaluation is focused exclusively on the novelty of the core ideas presented. While the system as a whole is new, a deconstruction of its constituent parts reveals that many of the foundational concepts are evolutionary extensions of prior art rather than revolutionary inventions.

The "Token-Time Bundle" (TTB) is conceptually similar to prior work. The idea of batching spikes over time to improve data reuse is not new. Jeong et al. [26] proposed "Parallel Time Batching" (PTB) for spiking CNNs, which this paper cites. The TTB (Section 3.2, page 948) extends this concept by adding a token dimension (BSn) to the time dimension (BSt). While this is a logical and necessary adaptation for transformer architectures, it represents an incremental step—a dimensional extension of a known technique—rather than a fundamentally new data packing paradigm.
Heterogeneous dense/sparse architectures are a well-established design pattern. The use of separate processing units for dense and sparse computations, managed by a routing or stratification unit (Section 5.2, page 950), is a known technique for optimizing workloads with varying sparsity. This principle has been explored in general-purpose architectures (e.g., NVIDIA's Ampere) and in prior SNN accelerators that aim to skip inactive neuron computations. The novelty here is not the heterogeneous architecture itself, but its application to workloads structured as TTBs. The contribution is in the integration, not the architectural concept.
Bundle Sparsity-Aware (BSA) Training applies a known principle to a new structure. Sparsity-aware training, particularly structured pruning where groups of parameters or activations are zeroed out, is a vast field of research. The BSA algorithm (Section 4.1, page 948) introduces a regularization term to encourage entire TTBs to become empty. This is a clever application of structured pruning, but the underlying mechanism—adding a group sparsity regularizer to the loss function—is a standard technique. The novelty is the choice of the target structure (the TTB), which is a direct consequence of the hardware design, not a fundamental advance in training algorithms.
The spiking attention core's "AND-Accumulate" is an implementation detail, not an algorithmic novelty. The paper highlights a reconfigurable core that uses "AND" and "Accumulate” operations (Section 5.5, page 952). However, the simplification of matrix multiplication to accumulation is an inherent property of event-based processing in SNNs. When a spiking query vector is multiplied by a spiking key matrix, the computation naturally reduces to accumulating the key rows corresponding to the spike locations in the query vector. The "AND" operation is simply a hardware realization of identifying these co-located spikes. The contribution is the design of a dedicated hardware unit that performs this known computation efficiently on TTB-formatted data, not the invention of the simplification itself.
The claim of being the "first dedicated hardware accelerator" needs careful qualification. The authors state in the abstract and introduction that Bishop is the "first dedicated hardware accelerator architecture... for spiking transformers." However, they also cite Qi et al. [49], which they describe as a "simple spiking transformer architecture." While Bishop is undoubtedly more sophisticated and presents a full co-design framework, the existence of [49] challenges the unqualified claim of being "first." A more accurate claim would be the "first HW/SW co-design framework" or the "first sparsity-aware heterogeneous accelerator" for this domain. The lack of a quantitative comparison against [49] is a notable omission.

Questions to Address In Rebuttal

The Token-Time Bundle (TTB) is presented as a key contribution. Could the authors please elaborate on the fundamental novelty of TTB beyond being a two-dimensional extension (token and time) of the one-dimensional Parallel Time Batching (PTB) concept from prior art [26]?
The paper claims to be the "first dedicated hardware accelerator... for spiking transformers." Could the authors please provide a more detailed differentiation from the work of Qi et al. [49]? A qualitative discussion on why Bishop's approach is fundamentally different and a justification for the absence of a quantitative benchmark comparison would strengthen the paper's positioning.
Regarding the BSA training algorithm (Section 4.1, page 948), the core idea is to apply a group sparsity regularizer. How does this technique fundamentally differ from established methods for structured pruning, other than the fact that the target "group" is the author-defined TTB?
The complexity vs. benefit trade-off for the heterogeneous core design is unclear. Introducing a stratifier and two separate core types adds significant design complexity. Given that this architectural pattern is known, what is the "delta" in performance gain that justifies this complexity specifically for spiking transformers, compared to a more unified, flexible sparse architecture?

RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

Abstract

Retrieval- augmented generation (RAG) is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload ...

ACM LINK

Reviews

Review 1

Paper Title: RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving Reviewer: The Guardian

Summary

The authors present RAGO, a framework for optimizing the performance of Retrieval-Augmented Generation (RAG) serving systems. The paper first introduces RAGSchema, a structured abstraction to describe the diverse landscape of RAG pipelines. Using this abstraction, the authors characterize the performance of four representative RAG paradigms, identifying shifting bottlenecks between retrieval and inference components under various configurations. They then propose the RAGO framework, which performs an exhaustive search over a design space of task placement, resource allocation, and batching policies to find Pareto-optimal system configurations. The evaluation, conducted using a custom simulation framework, claims that RAGO can achieve up to a 2x increase in QPS/chip and a 55% reduction in TTFT compared to a baseline extended from LLM-only serving systems.

Strengths

Problem Formulation: The paper correctly identifies a timely and critical problem. As RAG systems move into production, the complexity of their serving pipelines—composed of heterogeneous computational stages—presents a significant optimization challenge. The work does a commendable job of structuring this problem.
Workload Abstraction (RAGSchema): The proposed RAGSchema (Section 3.2, page 4) is a logical and useful abstraction. It provides a structured vocabulary for defining and comparing complex RAG pipelines, which is a necessary first step for any systematic analysis.
Performance Characterization: The analysis in Section 5 offers valuable insights into RAG system behavior. The identification of shifting bottlenecks (e.g., retrieval dominating in hyperscale scenarios for small models, Section 5.1, vs. encoding dominating in long-context scenarios, Section 5.2) is well-articulated and highlights the core challenge that the paper aims to solve. The sensitivity analyses regarding model size, query numbers, and sequence lengths are particularly informative.

Weaknesses

My primary concerns with this paper center on the methodological rigor of the evaluation and the novelty of the proposed optimization technique. The claims of optimality and significant performance gains rest on a foundation that appears non-verifiable and potentially fragile.

Reliance on a Non-Validated, In-House Simulator: The paper's entire quantitative analysis is predicated on an "in-house calibrated XPU simulator" (Section 4, page 6). The authors state it is "well-correlated with the production-grade XPU accelerators" but provide absolutely no evidence to substantiate this claim. There are no correlation plots, error analyses, or quantitative comparisons to real hardware measurements. Similarly, the retrieval performance model is "calibrate[d]... using internal production datasets." This lack of transparency and validation makes it impossible to assess the credibility of the results. The findings could be artifacts of the simulator's specific assumptions rather than reflections of real-world system behavior. Without rigorous validation, the results are fundamentally irreproducible and untrustworthy.
"Optimization" via Brute-Force Search: The core of the RAGO framework is an "exhaustive search" (Algorithm 1, page 11). While functional, this is the most basic possible approach to exploring a design space. To present brute force as a novel optimization framework is a significant overstatement. The paper lacks any discussion on the scalability of this search. What is the size of the configuration space for the evaluated workloads? How long does RAGO take to find the "optimal" schedule? A framework that requires hours or days of search to configure a system is impractical. The contribution here appears to be the enumeration of a search space, not a sophisticated method for navigating it.
Potentially Weak Baseline: The claimed 2x performance improvement is measured against a baseline described in Section 7.1 (page 11). This baseline collocates RAG components with the LLM's prefix stage and uses a "carefully tune[d]" 1:1 prefix:decode resource ratio. While not a naive strawman, it is questionable whether this represents a strong, state-of-the-art deployment. A skilled systems engineer would likely already consider disaggregating a component known to be a bottleneck (like the long-context encoder in Case II). The significant gains reported may be partially inflated by comparing against a configuration with obvious, well-understood inefficiencies. The paper does not sufficiently justify that this baseline represents a legitimate, production-quality manual optimization.
Oversimplification of System Dynamics: The performance models, based on roofline principles (Figure 4, page 6), inherently simplify complex system interactions. For instance, the analysis of iterative retrieval stalls in Section 5.3 (page 9) explicitly assumes zero latency for retrieval and prefix stages to isolate the batching effect. This is an unrealistic condition that likely magnifies the observed slowdown. The model does not appear to account for system-level effects such as network contention, OS scheduler jitter, or nuanced cache interactions, all of which can significantly impact end-to-end performance and invalidate the clean separation of stages assumed by the simulator.

Questions to Address In Rebuttal

Simulator Validation: Can the authors provide quantitative evidence validating their XPU and retrieval simulators against real hardware? This should include correlation plots and an analysis of prediction error (e.g., MAPE) across a set of representative model and retrieval configurations.
Scalability and Practicality of RAGO: What is the runtime of the RAGO exhaustive search for the case studies presented in the paper? How does the search time scale with the number of pipeline stages and the granularity of resource allocation options? At what point does this brute-force approach become intractable?
Baseline Justification: Please provide a stronger justification for why the chosen baseline represents a state-of-the-art, manually optimized system. Why is this specific collocation strategy and 1:1 resource split the correct point of comparison, as opposed to other plausible heuristic-based configurations?
Generality of Architectural Conclusions: The analysis is based on a specific family of "XPU" accelerators with parameters detailed in Table 2 (page 5). How would the key findings—particularly the bottleneck locations for each paradigm—change if run on an accelerator with a different compute-to-memory-bandwidth ratio or a different interconnect topology (e.g., an NVIDIA H100 GPU)?
Impact of Placement Heuristics: In Section 6.1 (page 10), you state that RAGO restricts collocation to "consecutive neighbors to avoid excessively complicating the search space." How do you know that an optimal configuration does not involve collocating non-neighboring stages? What is the potential performance loss introduced by this simplifying heuristic? This constraint seems to contradict the claim of finding a truly optimal schedule.

Review 2

Review Form: The Synthesizer

Summary

This paper presents RAGO, a systematic framework for understanding, characterizing, and optimizing the serving performance of Retrieval-Augmented Generation (RAG) systems. The authors' core contribution is a three-pronged approach to tackle the immense complexity of modern RAG pipelines. First, they introduce RAGSchema, a structured abstraction to formally describe the diverse landscape of RAG workloads. Second, using this schema, they conduct a comprehensive performance characterization of four representative RAG paradigms, revealing non-obvious performance bottlenecks that shift dramatically depending on the workload. Finally, they build RAGO, a system optimization framework that uses these insights to explore a vast scheduling policy space—spanning task placement, resource allocation, and batching—to find Pareto-optimal configurations. The authors demonstrate that RAGO can achieve up to a 2x increase in queries-per-second (QPS) per chip and a 55% reduction in time-to-first-token (TTFT) latency compared to a strong baseline extended from LLM-only serving systems.

Strengths

The most significant strength of this work is its conceptual contribution of bringing a principled, systematic methodology to the chaotic and rapidly evolving domain of RAG system optimization. This paper elevates the conversation from ad-hoc tuning of individual components to a holistic, co-design problem.

The RAGSchema Abstraction: The introduction of RAGSchema (Section 3.2, page 4) is a standout contribution. In a field where new RAG variants emerge constantly, this abstraction provides a much-needed canonical language for describing and comparing workloads. By capturing key performance-relevant attributes (pipeline stages, model sizes, retrieval parameters), it creates a foundation for reproducible research, benchmarking, and systematic optimization that was previously lacking. It effectively tames the complexity of the problem space.
Insightful Workload Characterization: The performance characterization in Section 5 is excellent and provides immense value to the community on its own. By analyzing four distinct paradigms (hyperscale, long-context, iterative, and rewriter/reranker), the authors demonstrate that there is no "one size fits all" solution. The findings—such as the database encoder becoming the bottleneck in long-context RAG (Section 5.2, page 8) or the subtle idleness effects of batched iterative retrievals (Section 5.3, page 9)—are non-obvious and critical for practitioners and system designers. This analysis effectively maps the problem terrain that RAGO is designed to navigate.
Holistic Optimization Space: RAGO addresses the full, coupled optimization problem. It doesn't just tune batch sizes; it considers the interplay between task placement (collocation vs. disaggregation), resource allocation across heterogeneous components (CPU servers and ML accelerators), and batching policies. This holistic view is crucial, as the paper shows that decisions in one dimension profoundly impact the others. This connects disparate research threads from LLM serving (prefix/decode splitting) and distributed systems into a unified framework for RAG.
Contextualization and Future-Looking Implications: This work provides a clear bridge between the worlds of ML model design, information retrieval, and computer systems/architecture. The finding that retrieval becomes a dominant bottleneck as ML accelerators improve (Figure 7a, page 8) offers a concrete directive for future hardware design, making a strong case for co-designing retrieval and inference accelerators (as explored in works like Chameleon [50]). RAGO provides the analytical framework needed to reason about such future systems.

Weaknesses

The weaknesses of the paper are less about flaws in the execution and more about the boundaries of its scope and potential areas for future expansion.

Abstraction of Quality: RAGSchema and RAGO primarily operate on system performance metrics (latency, throughput). While the authors acknowledge that RAG parameters (e.g., number of retrieved documents, percentage of database scanned) affect model quality (recall), this critical quality-performance trade-off is outside the core optimization loop. In a real-world deployment, a user might specify a target recall, which would constrain the search space. Integrating this dimension would make the framework even more powerful.
Scalability of the Optimization Search: RAGO relies on an exhaustive search to find the Pareto frontier (Section 6.2, page 11). While feasible for the paradigms explored, this approach may face scalability challenges as RAG systems evolve into more complex, dynamic, and conditional agentic workflows. A discussion on how this framework might incorporate heuristic or learning-based search strategies for more complex future workloads would be valuable.
Static Pipeline Assumption: The framework appears to assume a relatively static RAG pipeline defined at the outset by RAGSchema. Agentic systems may involve dynamic, data-dependent execution paths (e.g., deciding to call a tool or perform another retrieval based on the content of a generated token). The current framework doesn't seem to explicitly model this dynamism, which represents the next frontier of complexity.

Questions to Address In Rebuttal

The decoupling of performance and quality is a pragmatic choice. Could the authors elaborate on how they envision RAGO being used in a production setting where a product owner might provide a quality constraint, such as a minimum retrieval recall? Would this simply prune the search space, or would it require a more fundamental change to the optimization objective?
Your baseline system is a thoughtfully tuned extension of an LLM-only system, which is a strong point of comparison. Could you comment on why the hybrid task placement strategies explored by RAGO (Figure 17b, page 12) are so effective compared to a more naive "collocate everything with the prefix" strategy that one might intuitively adopt?
Given the exhaustive search methodology, could you speculate on the computational cost of running RAGO itself? How long does it take to generate a Pareto frontier for one of the case studies, and how do you see this scaling as RAG pipelines incorporate more optional stages and components?

Review 3

Paper Title: RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving Reviewer Persona: The Innovator (Novelty Specialist)

Summary

The authors present a framework for optimizing the serving performance of Retrieval-Augmented Generation (RAG) systems. The work makes three primary claims of contribution: (1) RAGSchema, a structured taxonomy for describing diverse RAG workloads; (2) a detailed performance characterization of four RAG paradigms using this schema, which reveals significant performance variability and shifting bottlenecks; and (3) RAGO, a system optimization framework that performs a systematic search over scheduling policies—specifically task placement, resource allocation, and batching—to find Pareto-optimal configurations for a given RAG workload.

Strengths

The primary novel contribution of this work lies in the synthesis of well-known systems optimization principles and their application to the specific, emerging domain of RAG serving. While the individual techniques employed by the RAGO framework are not new, their holistic integration to navigate the complex, heterogeneous (CPU for retrieval, XPU for inference), and multi-stage pipeline of RAG systems is a novel endeavor.

Specifically, the paper's novelty rests on:

A Unified Optimization Framework for a Novel System Class: Prior work has optimized LLM serving (prefix/decode splitting) or retrieval systems (ANN algorithms) in isolation. Other RAG-specific works, such as Chameleon [50] or PipeRAG [51], have proposed point solutions for specific bottlenecks (retrieval acceleration, iterative retrieval stalls). This paper is the first I am aware of to propose a generalizable and systematic framework that co-optimizes the entire end-to-end RAG pipeline, considering the interplay between all its optional and mandatory stages (encoder, rewriter, retrieval, reranker, prefix, decode).
Codification of the RAG Search Space: The RAGSchema abstraction, while fundamentally a taxonomy, serves as a necessary and novel contribution in the context of this work. It formalizes the configuration space of RAG pipelines, which is a prerequisite for any systematic optimization. By defining this structure, it enables the RAGO framework to operate methodically, a step beyond ad-hoc optimizations.

Weaknesses

While the application of the framework is novel, its constituent components and underlying methodology lack fundamental novelty. The work is more a feat of rigorous systems engineering and integration than one of conceptual invention.

Lack of Algorithmic Novelty in the Optimization Framework: The core of RAGO is a systematic, exhaustive search over a discretized space of scheduling policies, guided by an analytical cost model (Algorithm 1, page 11). This methodology is well-established and conceptually identical to design-space exploration frameworks in other domains. For instance, Timeloop [88] and MAESTRO use this exact approach (analytical modeling + systematic search) to find optimal dataflows for DNN accelerators. The authors have effectively built a "Timeloop for RAG serving scheduling," which is a valuable engineering contribution but not a new optimization paradigm.
The "Novel" Decisions are Applications of Prior Art: The key scheduling decisions RAGO explores are direct extensions of known techniques:
- Task Placement (Collocation vs. Disaggregation): The central placement decision explored by RAGO is whether to group (collocate) or separate (disaggregate) pipeline stages. This directly mirrors the "phase splitting" of prefix and decode stages, a concept already thoroughly explored in prior LLM serving literature such as Splitwise [89] and DistServe [132]. RAGO merely applies this known principle to a pipeline with more stages.
- Batching Policies: The use of techniques like continuous batching for the decode stage is considered standard practice in modern LLM serving systems like Orca [120] and vLLM [62]. RAGO incorporates this as a known best practice rather than introducing a new batching methodology.
Characterization as the Primary Insight: Much of the paper's intellectual weight rests on the characterization study (Section 5, pages 7-9), which demonstrates how bottlenecks shift depending on the RAGSchema. While insightful, a characterization study's novelty is contingent on it revealing profoundly counter-intuitive truths. The findings here—that retrieval can be a bottleneck at hyperscale (Case I), or that a small encoder can be a bottleneck on a huge input (Case II)—are logical consequences of Amdahl's Law applied to a new pipeline structure. They are valuable confirmations but not paradigm-shifting discoveries.

In summary, the paper's novelty is confined to its specific application domain. It does not introduce a new type of algorithm, a new systems primitive, or a fundamentally new theoretical insight into system performance. The contribution is the creation of the first such systematic optimizer for RAG, not the invention of a new kind of optimizer.

Questions to Address In Rebuttal

The search methodology in RAGO appears to be an exhaustive search over a pre-defined and discretized policy space. How is this approach fundamentally different from prior design-space exploration frameworks like Timeloop [88], beyond the target domain? Could the authors elaborate on any novel search or pruning strategies that were required to make this exploration tractable for the RAG domain?
The concept of disaggregating compute stages with different workload characteristics (compute-bound vs. memory-bound) is central to recent LLM serving systems [89, 132]. Can the authors clarify what new principles of task placement RAGO introduces beyond applying this known heuristic to a wider array of pipeline stages (e.g., rewriter, reranker)?
RAGSchema is presented as a key contribution. While it is a clear and useful abstraction, taxonomies themselves are not always considered novel research contributions. Could the authors argue why this abstraction is more than a descriptive framework and constitutes a novel scientific contribution in its own right, perhaps by demonstrating how it enables insights or optimizations that would be impossible otherwise?

Transitive Array: An Efficient GEMM Accelerator with Result Reuse

Abstract

Deep Neural Networks (DNNs) and Large Language Models (LLMs) have revolutionized artificial intelligence, yet their deployment faces significant memory and computational challenges, especially in resource-constrained environments. Quantization techniques ...

ACM LINK

Reviews

Review 1

Paper Title: Transitive Array: An Efficient GEMM Accelerator with Result Reuse Reviewer Persona: The Guardian (Adversarial Skeptic)

Summary

The authors introduce a novel sparsity paradigm, "transitive sparsity," which aims to reduce computations in GEMM by reusing partial results. The core idea is that if the binary representation of one row (a "TransRow" after bit-slicing) is a subset of another, the result for the latter can be computed by adding the difference to the former's result. This dependency is modeled using a Hasse graph. To manage these dependencies and enable parallel execution, the paper proposes a hardware mechanism called the "Scoreboard," which determines an execution order. The authors present the "Transitive Array," a multiplication-free accelerator architecture designed around this principle. Evaluations on LLaMA and ResNet models claim significant speedup and energy reduction over state-of-the-art accelerators like Olive and BitVert.

Strengths

The fundamental concept of identifying subset relationships in bit-sliced matrices to reuse partial sums is a logically sound approach to reducing redundant additions. It represents a form of fine-grained common subexpression elimination.
The formalization of this dependency using a Hasse diagram (Section 2.3) provides a solid theoretical foundation for the proposed sparsity pattern.
The architectural design is explicitly "multiplication-free," relying on shifters and adders post-bit-slicing. If the overhead is properly managed, this can lead to an efficient PE design.

Weaknesses

My primary concerns with this manuscript lie in the experimental methodology, the unquantified overhead of its core mechanism, and several unsubstantiated claims that undermine the impressive results.

Severely Limited and Potentially Unrepresentative Evaluation Scope: The evaluation on LLaMA models is critically flawed. The authors state in Section 5.1 that "we only extract the first Transformer block with a prefill sequence length of 2048." This is wholly insufficient for the following reasons:
- The computational characteristics and data distributions (weights and activations) of the first block are not necessarily representative of the entire model. Different layers can have vastly different properties.
- Crucially, the evaluation focuses exclusively on the prefill phase. LLM inference performance is often dominated by the memory-bound, latency-critical token-by-token decoding phase. By omitting an analysis of the decode phase, the paper ignores a major part of the problem and presents a one-sided performance picture that favors their compute-bound optimizations.
Unquantified and Potentially Prohibitive Overhead of the Dynamic Scoreboard: The paper’s claims of supporting dynamic activation patterns in attention layers hinge entirely on the "Dynamic Scoreboard." However, the actual runtime cost of this unit is never properly analyzed. The Scoreboarding algorithm (Algorithms 1 & 2) involves forward and backward passes over all potential nodes (up to 256 for 8-bit TranSparsity) to build a dependency forest.
- The paper fails to provide any cycle-level analysis of the Scoreboard's latency. How many cycles does it take to process a sub-tile and generate the Scoreboard Information (SI) before the PEs can begin execution? This latency is directly on the critical path for every tile processed dynamically.
- While an area figure is provided in Table 2, the energy consumption of this complex, on-the-fly graph traversal and optimization is not broken out. It is simply absorbed into the "Core" energy in Figure 11. This lack of transparency makes it impossible to assess if the savings in the PEs are nullified by the overhead of the Scoreboard.
Flawed and Unfair Baseline Comparisons: The performance comparisons, particularly for attention layers, are not rigorous.
- In Section 5.7, the authors compare their 8-bit quantized Transitive Array for attention against a 16-bit BitFusion baseline. This is not an iso-precision comparison. It is expected that a lower-precision implementation would be faster. The comparison is meaningless without a proper 8-bit baseline.
- The authors dismiss other state-of-the-art accelerators (Olive, BitVert) from the attention analysis by claiming they "do not support Attention layers" (Section 5.7). This is a strong claim that requires more justification. While these works focus on pre-processed weights, it is not proven that they are fundamentally incompatible. This exclusion conveniently removes the most relevant competitors from a key part of the evaluation.
Over-reliance on Favorable, Potentially Non-Generalizable Data: The entire hardware design and scheduling strategy is predicated on a key statistic mentioned in Section 4.6: "only approximately 1.67% of TransRows in our design have distances greater than 1."
- Where does this number come from? Is it an average across all layers and models, or a single data point from the LLaMA-1-7B first layer? The performance of the pipeline, especially potential stalls in the PPE array, is extremely sensitive to this value. The paper provides no evidence that this property holds across diverse models and workloads.
- The design space exploration in Section 5.2 is performed using a "random 0-1 matrix." The authors themselves show in Figure 13 that real data and random data have different characteristics. Basing fundamental design decisions like tiling size and bit-width on random data, rather than a thorough characterization of real-world model data, is methodologically weak.
Unsupported Theoretical Claims: In Section 2.4, the paper asserts that the Hasse graph can be divided into "T independent trees for T-bit transitive sparsity." This is presented as fact but lacks a formal proof or a detailed argument. This property is fundamental to their parallelization strategy, and its validity cannot be taken on faith. Is this an inherent property of the graph, or a consequence of their heuristic pruning (e.g., "assign only one prefix to every node"), which might lead to suboptimal reuse?

Questions to Address In Rebuttal

Please provide end-to-end inference performance for LLaMA models, including both the prefill and, critically, the token decode phases. How does Transitive Array perform on metrics like time-to-first-token and token latency?
Provide a detailed breakdown of the dynamic Scoreboard's latency in cycles. For a given sub-tile size (e.g., 256xT), how many cycles of overhead does the Scoreboard introduce before the processor can begin its work? How does this overhead scale with tile size and sparsity?
Please justify the comparison in Figure 12 by either providing results against an 8-bit attention baseline or explaining why such a baseline is impossible to construct for BitFusion. Furthermore, provide a more rigorous argument for why accelerators like Olive and BitVert cannot be adapted for attention layers.
Please provide data showing the distribution of dependency distances (as in Figure 9d) across all layers of the evaluated LLaMA models, as well as for other models like ResNet. How robust is the "1.67% of distances > 1" assumption?
Please provide a more formal argument or proof for the claim in Section 2.4 that the dependency graph can be decomposed into T independent trees. Explain whether this is a natural property or a result of the Scoreboard's heuristic choices, and what the performance trade-offs of those choices are.

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces "transitive sparsity," a novel and intriguing paradigm for accelerating General Matrix Multiplication (GEMM) in the context of bit-sliced, quantized deep neural networks. The core idea is to identify and exploit subset relationships between binary row vectors in a bit-sliced weight matrix. Instead of recomputing partial sums from scratch for each row, the proposed "Transitive Array" accelerator reuses the result from a "prefix" row (a subset) and performs a small number of additional accumulations to compute the result for a "suffix" row (a superset).

To manage the complex dependencies this creates, the authors propose a principled "Scoreboard" mechanism that represents the relationships as a Hasse diagram, allowing for the efficient, on-the-fly generation of an optimal execution order. The paper presents a complete hardware architecture that is multiplication-free and designed to handle both static weights and dynamic activations, a key challenge for attention mechanisms in LLMs. The evaluations demonstrate significant speedup and energy reduction over contemporary bit-serial and quantization-aware accelerators like BitVert and Olive.

Strengths

The primary strength of this work lies in its conceptual novelty and the depth of the proposed solution.

A Fundamentally New Perspective on Sparsity: The community has extensively explored value-based sparsity (zero weights/activations) and, more recently, bit-level sparsity (zero-skipping in bit-serial computation). This paper moves beyond simply skipping work and introduces a powerful concept of reusing work. "Transitive sparsity" is a genuinely new way to frame computational redundancy in quantized GEMM. It connects the problem to the rich mathematical field of order theory and opens a new avenue for optimization.
Principled and Elegant Formalization: The decision to model the reuse dependencies using a Hasse diagram (Section 2.3, page 3) is particularly insightful. This is not a heuristic-driven approach but a formal one that allows the authors to reason systematically about execution order, parallelism, and load balancing. This principled foundation gives the work a sense of robustness and intellectual depth that is often missing in purely empirical accelerator designs.
An End-to-End Architectural Solution: The authors do not stop at the conceptual level. They present a comprehensive architectural design, the Transitive Array, that tackles the practical challenges of their idea. The design of the Scoreboard unit, with its forward and backward passes to build a "balanced forest" of dependencies (Figure 5, page 5), is a sophisticated piece of microarchitecture. Critically, the inclusion of a dynamic Scoreboard to handle the online nature of activation tensors in attention layers demonstrates a keen awareness of the primary bottleneck in modern LLMs. This elevates the work from a clever trick for FC layers to a potentially viable solution for entire Transformer models.
Significant Potential Impact: If transitive sparsity is as prevalent as the authors' analysis suggests (e.g., theoretical 87.5% sparsity for LLaMA-7B mentioned on page 2), this work could have a substantial impact on the design of future low-precision accelerators. It suggests that merely building efficient pop-count and accumulate units is not enough; co-designing hardware to understand and exploit deeper structural patterns within the bit representations could yield another major leap in efficiency. The multiplication-free nature of the design further positions it as an extremely low-power solution.

Weaknesses

The weaknesses of the paper are primarily related to the potential overhead of its complexity and questions about the generality of its core assumption.

Hardware Overhead and Scalability of the Scoreboard: The Scoreboard is the brain of the operation, but its complexity is non-trivial. It involves sorting (Hamming-order), graph traversal (forward/backward passes), and bitmap manipulation. While the paper asserts this is efficient, a more detailed analysis of its area, power, and latency overhead is needed. It is unclear how this overhead scales with larger bit-widths (e.g., T=16) or larger tile sizes, and whether the Scoreboard's latency could become the new bottleneck in the pipeline, especially in the dynamic, on-the-fly scenario.
Generality of Transitive Sparsity: The entire premise hinges on the frequent occurrence of these subset/superset bit patterns. The evaluation is strong for LLaMA models, but the phenomenon needs to be characterized more broadly. How prevalent is transitive sparsity in other model families like Vision Transformers, CNNs, or diffusion models? How is it affected by different quantization algorithms (e.g., rounding modes, affine vs. symmetric quantization) or data distributions? While the work is compelling for LLMs, its claim as a general GEMM accelerator would be strengthened by a broader empirical study of this phenomenon.
Complexity of Data Movement: The result reuse mechanism implies a complex data movement pattern. A processing element working on a "suffix" row needs to fetch a partial sum computed by a potentially distant "prefix" row. The paper mentions a distributed prefix buffer and a Benes network (Section 4.4, page 7), but the potential for bank conflicts, network contention, and the management of this buffer (e.g., replacement policies for "prefix misses") feel underexplored. This data-forwarding network could introduce significant overhead and complexity that might erode some of the computational gains.

Questions to Address In Rebuttal

Could the authors provide a more detailed analysis of the area, power, and latency overhead of the dynamic Scoreboard unit? Specifically, how does its critical path compare to the data path computation, and how does it scale as the tile size (N) and bit-width (T) increase?
The performance of the Transitive Array is fundamentally tied to the prevalence of transitive sparsity. While the results on LLaMA models are promising, have the authors analyzed its prevalence across a wider range of model architectures (e.g., ViTs, different CNN backbones) and quantization schemes? How sensitive is the technique to the underlying weight and activation distributions?
The prefix buffer is central to enabling result reuse. Could the authors elaborate on the data access patterns to this buffer and the potential for bank conflicts or network contention, especially in the distributed design? How does the system handle a 'prefix miss' if the required partial sum has been evicted or belongs to a different tile that is not currently being processed?

Review 3

Review Form: The Innovator (Novelty Specialist)

Summary

This paper introduces a novel sparsity paradigm for GEMM acceleration, termed "transitive sparsity." The core idea is to exploit redundancies within a bit-sliced representation of a quantized matrix. Instead of merely skipping zero-valued bits, the proposed method identifies when the set of active bit positions for one row (a "TransRow") is a superset of the active bit positions for another. In such cases, the partial sum computed for the subset row can be reused as a starting point for the superset row's computation, requiring only the accumulation of the difference.

To manage the complex dependencies this creates, the authors formalize the relationships between TransRows using a Hasse diagram. They then propose a hardware accelerator, the "Transitive Array," which features a "Scoreboard" unit to dynamically determine the optimal, dependency-aware execution order. The architecture is multiplication-free, relying on XOR operations to identify difference bits and adders for accumulation. The authors claim significant speedup and energy reduction over state-of-the-art bit-slice and quantization-aware accelerators.

Strengths

The primary strength of this paper is the genuine novelty of its core concept. My analysis of the prior art confirms that "transitive sparsity" represents a new and fundamentally different way to conceive of sparsity in the context of bit-sliced computation.

Novel Sparsity Paradigm: The central contribution, "transitive sparsity," appears to be a genuinely novel paradigm. Existing bit-slice accelerators (e.g., Bit-Pragmatic, BitVert) primarily focus on zero-skipping, which is a form of unstructured sparsity exploitation. This paper shifts the focus to exploiting structural relationships between non-zero bit patterns. The idea of reusing computation from (0011) to compute (1011) (Figure 1, page 2) by treating the former as a sub-problem of the latter is a clear departure from prior work.
Elegant Formalization: The use of a Hasse graph (Section 2.3, page 3) to represent the partial ordering of TransRows is a particularly elegant and novel formalization for this specific problem. It provides a solid theoretical foundation for the dependency analysis and elevates the contribution beyond a mere ad-hoc optimization.
Qualitative Shift from Prior Art: This work can be conceptually framed as applying common sub-expression elimination (CSE) to the bit-level operations of a GEMM. While CSE is a well-known concept, its specific application to the bit-patterns of rows within a single GEMM to enable partial sum reuse in a hardware accelerator is, to my knowledge, new. It exploits a type of redundancy that is orthogonal to and deeper than the zero-value sparsity targeted by most contemporary accelerators.

Weaknesses

While the core idea is novel, its realization in the Transitive Array introduces considerable complexity, and the paper does not fully delineate the boundaries of its applicability.

High Implementation Complexity for Novelty: The proposed Scoreboard mechanism (Section 3, page 4) is a highly complex and stateful piece of hardware. It must record all present TransRows, build a dependency graph via forward and backward passes, and balance workloads across lanes. This is an order-of-magnitude more complex than the simple zero-detection logic it seeks to improve upon. The novelty comes at the cost of significant control logic overhead, which may have implications for timing, area, and power not fully captured by the high-level comparison in Table 2 (page 9).
Scalability of the Core Idea: The Hasse graph representation scales exponentially with the bit-width T. For the 8-bit TransRows explored (2⁸ = 256 nodes), the Scoreboard is manageable. However, the novelty of this approach seems questionable for wider data types (e.g., 16-bit), where a graph with 2¹⁶ nodes would be computationally intractable to analyze at runtime. The paper acknowledges this implicitly by selecting T=8, but a more explicit discussion of this fundamental scaling limitation is warranted.
Narrowing the Scope of Novelty: The concept of result reuse via delta computation is not new in a broader sense (e.g., temporal reuse in video processing, differential data transmission). The paper’s novelty lies strictly in its intra-GEMM application based on static bit patterns. A brief discussion situating the work relative to these broader concepts of computational reuse would help to more precisely define the "delta" of the contribution.

Questions to Address In Rebuttal

Overhead of the Scoreboard: Could the authors provide a more detailed analysis of the latency and energy overhead of the dynamic Scoreboard unit itself? The Scoreboarding process involves sorting and multiple graph traversal passes (Figure 5, page 5). How many cycles does this process take for a given tile size (e.g., 256 rows), and how does this pre-computation latency impact overall pipeline efficiency, especially for smaller GEMM operations?
Scalability to Wider Bit-Widths: The paper’s design space exploration (Section 5.2, page 8) concludes that 8-bit is optimal. Does this imply that the novelty of transitive sparsity is fundamentally limited to low-precision (<= 8-bit) operands? How would the authors propose to apply this concept to 16-bit or 32-bit integer GEMM, where the Hasse graph becomes intractably large?
Differentiation from Broader Concepts: The concept of reusing partial products bears a resemblance to common sub-expression elimination (CSE) in compilers or certain multiplier architectures that aim to reduce the number of partial products. Can the authors further differentiate their contribution from these broader concepts, clarifying why its application to bit-sliced DNN weights is a non-obvious and significant inventive step?
Sensitivity to Data Distribution: How sensitive is the performance of "transitive sparsity" to the underlying data distribution? The comparison in Section 5.9 (page 11) between real and random data is a good start. However, do the benefits hold for non-DNN workloads or for quantization schemes (e.g., non-uniform) that might produce less structured or less-repeating bit patterns? This would help clarify if the novelty is in a general-purpose GEMM optimization or one highly co-designed for the statistical properties of quantized neural networks.

Light-weight Cache Replacement for Instruction Heavy Workloads

Abstract

The last-level cache (LLC) is the last chance for memory accesses from the processor to avoid the costly latency of accessing the main memory. In recent years, an increasing number of instruction heavy workloads have put pressure on the last-level cache. ...

ACM LINK

Reviews

Review 1

Paper Title: Light-weight Cache Replacement for Instruction Heavy Workloads Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose PACIPV, a static, low-overhead LLC replacement policy based on an extension of Re-Reference Interval Prediction (RRIP). The core contribution is the use of separate Insertion and Promotion Vectors (IPVs) for demand and prefetch accesses, which are determined via an exhaustive search over a heuristically constrained design space. The authors claim that for instruction-heavy workloads, PACIPV achieves performance competitive with or slightly better than a complex, state-of-the-art policy (Mockingjay) and superior to the baseline SRRIP, while incurring significantly less hardware overhead. A central, and counter-intuitive, finding is that this performance gain is achieved despite an increase in the Last-Level Cache (LLC) Misses Per Kilo-Instruction (MPKI), which the authors attribute to a reduction in the average L1I demand miss latency.

Strengths

Low Hardware Overhead: The paper convincingly argues that PACIPV's hardware cost is minimal, representing a negligible addition to a standard SRRIP implementation. This is a clear and quantifiable advantage over complex, table-based predictors like Mockingjay or SHiP, making the proposal pragmatically appealing.
Systematic Design Space Exploration: The methodology of constraining the vast IPV design space with heuristics and then performing an exhaustive search is more rigorous than relying on stochastic methods like genetic algorithms. This approach lends confidence that the selected IPVs are indeed optimal within the defined constraints.
Logical Separation of Access Types: The fundamental idea of treating demand and prefetch accesses with distinct policies is sound. Prefetches are speculative and may have different reuse characteristics than demand requests; a policy that acknowledges this distinction is well-motivated.

Weaknesses

My primary concerns with this work center on the central performance claim, the generality of the findings, and the rigor of the evaluation methodology.

The Central Performance Claim is Unsubstantiated: The paper's most critical argument—that higher LLC MPKI leads to better performance—hinges on a tenuous causal link to reduced L1I average demand miss latency (Section 5.1, page 8). This explanation is insufficient and lacks rigorous proof.
- The authors hypothesize that the retained blocks are more critical, or that the additional misses are to non-critical blocks that do not gate execution due to memory-level parallelism (MLP). This is pure speculation. The paper provides no data on MLP, no analysis of the criticality of the misses incurred by PACIPV versus other policies, and no distribution of miss latencies. "Average" latency can be a misleading metric; a policy could reduce latency for many non-critical misses while increasing it for a few highly critical ones, resulting in a net performance loss not captured by the average.
- Without a quantitative analysis connecting the specific eviction decisions made by PACIPV to a tangible reduction in critical-path stalls, the core performance claim is an observation that conflates correlation with causation.
Questionable Generality and Overfitting: The chosen IPVs appear to be highly tuned to the specific characteristics of the CVP training workloads, undermining the claim of a broadly applicable static policy.
- The results on the GAPS workloads are telling (Section 5.2.3, page 10). When using the IPVs trained on CVP (PACIPV-I), the policy significantly underperforms Mockingjay. It only becomes competitive when retrained specifically on GAPS (PACIPV-D). This strongly suggests that the policy does not generalize well across workload domains (instruction-heavy vs. data-heavy) and is essentially overfitting to the training set's microarchitectural behavior.
- This contradicts the premise of finding a robust, static policy. Instead, it suggests that the methodology of searching for IPVs is effective, but the resulting static policy is brittle. The authors briefly mention set-dueling (page 6) but do not pursue it, which seems like a missed opportunity to address this clear limitation.
Opaque Workloads Weaken the Evaluation: The evaluation relies heavily on the CVP benchmark suite, which the authors admit are opaque commercial traces whose "actual identities... are not revealed publicly" (Section 4.2, page 7).
- This presents a major methodological flaw. We cannot independently verify their classification as "instruction heavy." The paper provides no characterization of these workloads (e.g., instruction footprint, L1I MPKI, branch misprediction rates, IPC).
- Without this crucial context, it is impossible to understand why PACIPV performs well or to assess whether the results are applicable to any broader class of applications. The findings are tied to a black-box workload set, limiting their credibility and impact.
Inconsistent and Overstated Claims Against SOTA: The paper's claims regarding its performance against Mockingjay are inconsistent.
- The abstract claims PACIPV "improves performance over a state-of-the-art LLC replacement policy (Mockingjay)".
- However, Figure 1 (page 1) shows a 0.1% lead, which is well within the noise margin of microarchitectural simulation. Figure 4 (page 8) is described as PACIPV "roughly matches the performance of Mockingjay".
- This is an overstatement of the contribution. A 0.1% difference does not constitute an "improvement"; it constitutes a statistical tie. The authors should present their results with more precision and less embellishment. The true benefit is achieving this parity with far lower cost, not by "outperforming" the SOTA.

Questions to Address In Rebuttal

The authors must address the following points to strengthen the paper for publication:

Prove the IPC vs. MPKI Link: Can you provide a detailed analysis to substantiate your claim that higher LLC MPKI leads to higher IPC? This would require, at a minimum:
- An analysis of memory-level parallelism (MLP) for the different policies.
- A distribution of miss latencies, not just the average.
- Data on whether the misses introduced by PACIPV are on the critical path of execution.
Justify Workload Selection: Please provide a detailed microarchitectural characterization of the CVP workloads used for training and testing. Justify their label as "instruction heavy" with metrics such as L1I miss rates, instruction cache footprint, and front-end vs. back-end stall breakdowns. Why should the community trust results derived from these opaque traces?
Reconcile the Generality Claim: How do you reconcile the claim of a robust, static policy with the GAPS results (Figure 10), which clearly show a significant performance degradation when the policy is not retrained for the target workload domain? If the IPVs are so workload-sensitive, doesn't this argue against a static implementation and in favor of a dynamic approach like set-dueling, which you mention but do not evaluate?
Clarify the SOTA Comparison: Please revise your claims of "improving performance" over Mockingjay. Acknowledge that the performance is, at best, statistically equivalent and reframe the contribution around achieving this parity with substantially lower hardware complexity.

Review 2

Paper Title: Light-weight Cache Replacement for Instruction Heavy Workloads Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces PACIPV, a lightweight Last-Level Cache (LLC) replacement policy specifically tailored for the increasingly prevalent instruction-heavy workloads found in modern systems. The core contribution is not the invention of a new mechanism from scratch, but rather an elegant and effective synthesis of two existing ideas: Re-Reference Interval Prediction (RRIP) and Insertion/Promotion Vectors (IPVs). By applying the IPV concept to the coarse-grained 2-bit state of RRIP (instead of the fine-grained state of LRU), the authors make the policy design space small enough to be searched exhaustively.

The key innovations are twofold: first, the separation of IPVs for demand and prefetch accesses, making the policy inherently prefetch-aware in a simple manner; and second, the demonstration that this low-complexity approach can achieve performance competitive with, or even slightly better than, a highly complex state-of-the-art policy like Mockingjay for the target workload class. The authors provide a compelling analysis, showing that while PACIPV may increase LLC Misses Per Kilo-Instruction (MPKI), it improves overall performance by reducing the average demand miss latency at the L1 instruction cache, indicating it is more effective at retaining performance-critical blocks.

Strengths

Excellent Problem Motivation and Timeliness: The work is grounded in a well-established trend documented in recent systems research (e.g., [2, 3, 11, 25]): the growing importance of the instruction-delivery front-end as a performance bottleneck in warehouse-scale computers. By focusing specifically on instruction-heavy workloads, the paper addresses a problem of significant and immediate relevance to both industry and academia.
Elegance and Simplicity of the Proposed Solution: The paper's greatest strength lies in its "less is more" philosophy. Instead of adding more tables, complex predictors, or machine learning structures, it re-purposes and combines existing lightweight mechanisms. The decision to base the policy on RRIP's 4-state recency representation is a crucial insight, as it transforms the intractable design space of LRU-based IPVs into a manageable one that can be optimized with a principled, exhaustive search. This stands in stark contrast to the high hardware overhead and verification complexity of policies like Mockingjay, as clearly highlighted in Table 2 (page 10). This simplicity makes the proposal highly practical and adoptable.
Insightful Performance Analysis: The authors present a sophisticated analysis that goes beyond top-line speedup numbers. The observation in Figures 5 and 6 (page 8) that PACIPV improves IPC despite a higher LLC MPKI is counter-intuitive and interesting. Their hypothesis—that the policy is better at reducing L1I demand miss latency by prioritizing the right blocks in the LLC—is well-reasoned and provides a deeper understanding of the mechanism's behavior. Furthermore, the discussion in Section 6.2.2 (page 12) explaining why PC-based predictors (used by SHiP and Mockingjay) are fundamentally ill-suited for instruction streams is a critical insight that beautifully contextualizes why a simpler, non-PC-based approach can win here.

Weaknesses

Static Policy in a Dynamic World: The primary limitation is the static nature of the IPVs, which are determined offline during a design-time training phase. While the authors demonstrate that the resulting vectors generalize well across a suite of similar workloads, a single static policy is unlikely to be optimal for the vast diversity of applications and phases running in a real-world datacenter. The "PACIPV-best" results in Figure 8 (page 10) clearly show the performance left on the table by not adapting per-workload.
Specialization vs. General-Purpose Performance: The work is explicitly and successfully targeted at instruction-heavy workloads. However, the results on the data-heavy GAPs suite (Figure 10, page 11) show that the instruction-trained vectors significantly underperform a policy trained on GAPs itself, and lag Mockingjay by a non-trivial margin. While retraining closes this gap almost entirely, it underscores that the policy, as presented, is a specialist. A general-purpose processor needs a policy that is robust across all workload types. The current proposal would require an adaptation mechanism to be universally effective.
Implicit Dependence on Prefetcher Characteristics: The policy's "prefetch-aware" nature is achieved by having a separate IPV for prefetch-initiated accesses. This is a clean design, but the optimal prefetch IPV is surely dependent on the behavior (e.g., accuracy, timeliness, aggressiveness) of the instruction prefetcher (EIP). The paper does not explore this sensitivity. For instance, would a less accurate prefetcher benefit from a more conservative prefetch IPV that inserts blocks at a higher (less-likely-to-be-kept) RRPV?

Questions to Address In Rebuttal

Regarding the static nature of the policy: Given the significant gap between the static PACIPV and the per-workload PACIPV-best (Figure 8, page 10), have the authors considered a lightweight dynamic adaptation mechanism? For example, instead of full set-dueling, could a simple runtime mechanism (perhaps hinted by the OS or a simple hardware counter) switch between a few pre-selected IPV pairs (e.g., one for instruction-heavy phases, one for data-heavy phases)? This could potentially bridge the gap to PACIPV-best without incurring the complexity of policies like Mockingjay.
The results on the GAPs suite (Figure 10, page 11) are very interesting. They demonstrate the value of training for the right workload class. The authors mention set-dueling as a potential solution. Could you elaborate on how you envision this working? Would dueling between just two candidate pairs—the best "instruction-heavy" pair and the best "data-heavy" pair found in your study—be sufficient to create a robust, general-purpose policy that performs well everywhere?
How sensitive are the optimal demand and prefetch IPVs to the specific instruction and data prefetchers used (EIP and Berti)? If a different, perhaps more aggressive, instruction prefetcher were used, how do you hypothesize the optimal prefetch IPV would change? A brief discussion on this co-design aspect would strengthen the paper's conclusions about prefetch-awareness.

Review 3

Paper Title: Light-weight Cache Replacement for Instruction Heavy Workloads Reviewer Persona: The Innovator (Novelty Specialist)

Summary

This paper proposes PACIPV, a static LLC replacement policy that extends the concept of Insertion/Promotion Vectors (IPVs) to a Re-Reference Interval Prediction (RRIP) baseline. The central novel claim, as I interpret it, is the use of two distinct, static IPVs—one for handling demand accesses and another for prefetch accesses. The authors leverage the coarse-grained nature of RRIP's 2-bit recency states to create a design space small enough for exhaustive search, contrasting with prior IPV work that required genetic algorithms for larger LRU state spaces. The paper's thesis is that this extremely simple, low-cost mechanism can achieve performance competitive with, or even slightly better than, complex state-of-the-art policies like Mockingjay, particularly for instruction-heavy workloads.

Strengths

From a novelty perspective, the paper has several strengths:

The Dual-Vector Mechanism: The core idea of maintaining two completely separate Insertion and Promotion Vectors for demand and prefetch streams appears to be novel. While prior work has certainly made replacement policies "prefetch-aware" (e.g., KPC, SHiP++, Mockingjay), they typically do so by using prefetch confidence to alter a single policy's decision or by modifying a specific promotion rule. The PACIPV approach of defining two distinct, complete policy vectors is a clean, simple, and previously unexplored mechanism for specializing replacement behavior.
Application of IPVs to RRIP: The original IPV concept, as presented by Jiménez [22] and Teran et al. [55], was applied to tree-based pseudo-LRU. Applying this vector-based policy definition to RRIP is a logical but non-obvious extension that has not been explored in prior literature. This adaptation is what enables the methodological novelty discussed next.
Methodological Novelty in Policy Search: The authors make a valuable contribution by demonstrating that the design space for RRIP-based IPVs is small enough to be searched exhaustively. This is a significant departure from the original IPV work [22] which relied on genetic algorithms to navigate the astronomical search space of 16-way LRU. By showing that an optimal static policy can be found deterministically for a given training set, the authors provide a more robust design methodology than prior heuristic approaches.
Novelty in Simplicity: The most compelling aspect of this work is the novelty of its "less is more" result. It is a novel finding that a mechanism with virtually no hardware overhead beyond baseline SRRIP (a few dozen static bits for the vectors) can match the performance of Mockingjay, a policy requiring kilobytes of state, sampled caches, and complex prediction logic. The contribution here is not just the mechanism itself, but the insight that extreme complexity may be unnecessary for this problem domain.

Weaknesses

My concerns are focused exclusively on the scope and positioning of the novel contribution:

Incremental Nature of the Contribution: The paper's novelty rests on the specific combination of three existing concepts: RRIP, IPVs, and differentiating policy for prefetches. While the specific combination is new, the conceptual leap is modest. The paper would be stronger if it more rigorously differentiated its dual-vector approach from the mechanisms in SHiP++ [58] (mentioned in Section 2.3) and KPC [33]. These policies also treat prefetches differently; the authors should more clearly articulate why their dual-vector mechanism represents a fundamental advancement over these other forms of differentiation, rather than just an alternative implementation.
Overstatement of Performance Novelty: The performance improvement over the most complex state-of-the-art, Mockingjay, is marginal at best (3.3% vs 3.2% speedup over LRU, as shown in Figure 1, page 1). The true novelty lies in achieving this performance parity with drastically lower complexity. However, the paper's narrative sometimes frames this as a performance win. The contribution's framing should be precise: the novelty is a new point on the Pareto curve of complexity vs. performance, not a new absolute performance record.
Static Nature Limits General Novelty: The proposed policy is fundamentally static; the vectors are hard-coded at design time based on a training set. While this is the source of its simplicity, it raises questions about the generality of the contribution. The methodology for finding the vectors is general, but the specific vectors presented are not. The work does not sufficiently explore how brittle these "optimal" vectors are to changes in the underlying microarchitecture—for instance, a different front-end prefetcher or a different core design. The novelty claim would be stronger if it included an analysis of this sensitivity.

Questions to Address In Rebuttal

To strengthen the paper's claims of novelty, I would like the authors to address the following:

The core novel mechanism is the use of distinct IPVs for demand and prefetch accesses. Can you please elaborate on the fundamental differences between this dual-vector approach and the way prior prefetch-aware policies, such as SHiP++ and KPC, differentiate their handling of prefetched blocks? Is the dual-vector approach strictly more expressive or does it simply offer a different trade-off?
Your methodology for finding optimal static IPVs is a key part of your contribution. However, these optimal vectors are trained with a specific configuration (e.g., Berti and EIP prefetchers). How sensitive are the derived vectors to these components? If a different, equally effective instruction prefetcher were used instead of EIP, how much performance would be lost by retaining the current vectors, and would a full retraining be necessary to restore performance parity?
Given that the performance gain over a complex policy like Mockingjay is ~0.1%, would you agree that the primary novel contribution is the demonstration that a vastly simpler policy can achieve performance parity, rather than a contribution to absolute performance? Clarifying this in the paper's narrative would help to properly position its novelty.

The Sparsity-Aware LazyGPU Architecture

Abstract

General- Purpose Graphics Processing Units (GPUs) are essential accelerators in data-parallel applications, including machine learning, and physical simulations. Although GPUs utilize fast wavefront context switching to hide memory access latency, memory ...

ACM LINK

Reviews

Review 1

Reviewer Persona: The Guardian (Adversarial Skeptic)

Summary

This paper proposes "LazyGPU," a GPU architecture designed to improve performance by reducing memory traffic. The core ideas are threefold: 1) a LazyCore that defers memory load requests until the data is actually needed, aiming to reduce memory system congestion; 2) a Zero Cache that stores masks to identify and eliminate memory transactions for data that is entirely zero (LazyCore+①); and 3) an instruction-level optimization that suspends loads for operands of ⊗ instructions (e.g., multiply) when the other operand is zero (LazyGPU). The authors evaluate their proposal using MGPUSim on a set of benchmarks, including ResNet-18 and LLaMA 7B, claiming significant speedups, particularly for sparse workloads.

While the proposed mechanisms are individually interesting, the work suffers from a reliance on an outdated architectural baseline, a critical logical inconsistency in its core mechanism, and an insufficient evaluation that fails to substantiate its primary claims against a rigorously defined state-of-the-art. The hardware overhead analysis is also overly simplistic and likely underestimates the true cost.

Strengths

The high-level concept of combining lazy execution with sparsity optimizations to tackle the GPU memory wall is a valid research direction.
The ⊗ instruction optimization (referred to as optimization ② in Section 4.3) is a novel, fine-grained technique for eliminating dead memory accesses by linking instruction semantics directly to memory system behavior.
The evaluation includes modern and highly relevant machine learning workloads (ResNet-18, LLaMA 7B), which is commendable.

Weaknesses

Outdated Architectural Baseline: The entire evaluation is built upon MGPUSim simulating an AMD GCN3 (R9 Nano) architecture, which is nearly a decade old. Modern high-performance GPUs (e.g., NVIDIA Hopper, AMD CDNA) feature vastly different memory subsystems, including technologies like HBM3, large last-level caches (e.g., Infinity Cache), and specialized hardware for asynchronous data movement (e.g., Tensor Memory Accelerator). The memory congestion problem, which is the central motivation for LazyCore, manifests very differently on these systems. The paper's core premise—that simply deferring loads is a net win—is not proven to be relevant for current hardware, rendering the performance results highly questionable in terms of generalizability.
Contradictory Claims Regarding Sub-Block Sparsity: The paper's logic for handling sparse data is critically flawed. In Section 3 (page 5, "Challenge 1"), the authors correctly state a major problem: "...it is not feasible to eliminate such memory transactions where the required portion of the data is zero as memory systems lack this information..." This acknowledges that the memory system operates at a fixed transaction granularity (e.g., 32B) and cannot natively handle requests for partial, all-zero data within that block. However, Section 4.2 and Figure 14 then claim that LazyCore+① is the solution, eliminating far more requests than can be accounted for by full 32B-block sparsity (which Figure 4 shows is very low, e.g., 2.7% for ResNet-18 inference). The paper never explains how the LazyCore overcomes this fundamental "not feasible" barrier. How does the core communicate the exact byte-level requirements of a strided load to the Zero Cache and memory system to enable this optimization? Without a precise hardware mechanism, the claimed benefits from optimization ① are unsubstantiated.
Insufficient and Poorly Defined Competitive Baseline: The authors position their work as an alternative to "eager execution." However, their comparison is superficial. At the end of the first paragraph of Section 5.2 (page 10), they provide two speedup numbers (1.26x and 1.02x) for an "eager execution with zero caches" baseline. There is absolutely no detail provided on this baseline. Was it simply the baseline MGPUSim with a zero cache added? Did it include a modern, aggressive hardware prefetcher, which is the hallmark of eager execution systems? A rigorous study would implement and evaluate against a strong, well-defined eager execution baseline. As it stands, the paper compares its complex lazy design against a strawman.
Oversimplified Hardware Overhead Analysis: The analysis in Section 5.5 is incomplete to the point of being misleading. It calculates storage costs for "Busy Bits" and "Address Upper Bits" and arrives at a negligible 0.009% area overhead. This analysis completely ignores the significant costs of the associated control logic:
- The Lazy Unit itself, which must track dependencies for all pending loads.
- The additional tag arrays, comparators, and control logic for the Zero Caches (stating they are "repurposed" from normal caches is not a zero-cost operation; it reduces the effective size of the data/instruction cache and requires new logic).
- The modifications to the instruction decoder and issue stage to identify ⊗ instructions and suspend/reactivate their associated loads. The actual area, power, and latency overhead of these complex logic structures is unstated and is almost certainly much higher than reported.
Unexamined Performance Penalty in Low-Contention Scenarios: The lazy execution model inherently adds latency to every memory access that is not eliminated. The authors claim that Thread-Level Parallelism (TLP) hides this, but their own data in Figure 3a shows that LazyCore only provides a benefit when the number of wavefronts is very high (>2048). For workloads with fewer active wavefronts, performance is either the same or worse than the baseline. This demonstrates a critical trade-off that the paper fails to adequately address: the proposed architecture may actively harm performance on any workload that does not fully saturate the memory system.

Questions to Address In Rebuttal

Please provide a compelling justification for using the GCN3 architecture. How do you expect the proposed mechanisms to interact with modern memory features like NVIDIA's TMA or AMD's Infinity Cache? What evidence can you provide that the observed memory congestion patterns and the resulting benefits of your design are not artifacts of this outdated baseline?
Please provide a precise, register-transfer-level (RTL-like) description of the mechanism that reconciles the "not feasible" challenge from Section 3 with the capabilities claimed for LazyCore+①. How does the core inform the memory hierarchy of the specific bytes a wavefront requires from a 32-byte memory transaction, such that a request can be elided if only that specific subset is zero?
Please provide a detailed specification of your "eager execution with zero caches" baseline used for comparison in Section 5.2. Detail the prefetching policy, MSHR configuration, and any other relevant parameters. A more thorough, head-to-head comparison is required.
Please provide a more comprehensive hardware overhead analysis that includes area and power estimates for the control logic of the Lazy Unit, the Zero Cache, and the modified instruction front-end, not just the storage bits.
Please analyze and discuss the performance of LazyGPU in scenarios with low memory contention or for latency-sensitive kernels. At what point does the inherent latency penalty of the lazy approach begin to dominate the benefits of congestion reduction?

Review 2

Paper Title: The Sparsity-Aware LazyGPU Architecture Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents LazyGPU, a novel GPU architecture designed to mitigate memory bandwidth contention by fundamentally reconsidering when memory requests should be issued. Instead of the conventional eager approach (e.g., prefetching), which issues requests as early as possible, LazyGPU deliberately delays them. This "lazy execution" model creates a critical look-ahead window, which the authors leverage to enable two powerful, sparsity-aware optimizations. First, by integrating a Zero Cache, the architecture can check if the data required by a wavefront is entirely zero and, if so, completely eliminate the memory request. Second, by analyzing upcoming instructions, it can identify and eliminate "dead" memory requests whose fetched values would have no impact on the program's outcome (e.g., a value that will be multiplied by zero). The authors evaluate their proposal on a range of workloads, demonstrating significant speedups, particularly on sparse neural network models like ResNet-18 and LLaMA 7B.

Strengths

The core strength of this paper lies in its elegant synthesis of three distinct architectural concepts—lazy execution, zero-value caching, and dead instruction elimination—into a cohesive and impactful solution for a critical problem.

Novel Synergistic Mechanism: The central insight is not just applying lazy execution to GPUs, but recognizing that laziness is an enabling mechanism. The delay inherent in the lazy model provides the necessary time and information to make intelligent decisions about memory traffic. While prior work has explored Zero Caches (as cited in Section 2, page 3), those proposals often still issue memory requests concurrently with the zero-mask check. LazyGPU’s approach of checking before issuing the request to the memory system is a significant conceptual advance that directly attacks bandwidth consumption, not just latency.
Addressing a Timely and Critical Problem: The paper is exceptionally well-positioned at the confluence of two major trends in high-performance computing: the "memory wall" and the increasing prevalence of sparsity. As models like LLMs grow, unstructured sparsity from techniques like pruning is becoming a key tool for managing computational cost. Current hardware, like NVIDIA's sparse tensor cores, often requires structured sparsity (e.g., 2:4 patterns). LazyGPU's mechanism is inherently suited to handle unstructured, fine-grained sparsity, making it highly relevant to the future of efficient machine learning acceleration.
Compelling Performance Results: The empirical results strongly support the architectural claims. The 2.18x speedup on LLaMA 7B inference at 60% sparsity (mentioned in the Abstract, page 1) is particularly compelling and immediately grounds the work in a high-impact domain. The methodical breakdown of performance gains from the baseline to LazyCore, LazyCore+①, and the full LazyGPU (Figure 9, page 10) provides a clear and convincing narrative of where the benefits originate.

Weaknesses

While the core idea is strong, the paper could benefit from a broader discussion of its place within the larger architectural landscape and the potential second-order effects of its design.

Tension with Existing Architectural Philosophies: The proposal fundamentally pushes back against the decades-long trend of "eager" and speculative execution. A key missing piece of the discussion is how LazyGPU would interact with other standard components of a modern GPU memory system, particularly hardware prefetchers. A lazy execution core and an aggressive, eager prefetcher are philosophically opposed. Does LazyGPU obviate the need for prefetching, or would the two mechanisms need a complex protocol to coexist without working at cross-purposes? A deeper exploration of this tension would better situate the work.
Implicit Assumptions about Workload Parallelism: The paper argues that GPUs' massive thread-level parallelism (TLP) is well-suited to hide the additional latency introduced by the lazy model. This is a plausible and intuitive argument. However, it remains an implicit assumption. The analysis would be stronger if it explored the limits of this assumption. For instance, how does performance scale as TLP decreases? Kernels with high register pressure or significant thread divergence might not have enough active wavefronts to hide the latency, potentially turning the lazy approach into a net negative.
Scope of Instruction-Based Elimination: The optimization to eliminate loads based on subsequent instructions (Section 4.3, page 8) is a powerful idea. The paper focuses primarily on multiply and multiply-add instructions, which are certainly dominant in the evaluated ML workloads. However, this concept could be generalized. For example, a load whose value is destined only for a logical AND with a register known to be zero could also be eliminated. A broader discussion on the potential classes of instructions amenable to this optimization would strengthen the generality of the contribution.

Questions to Address In Rebuttal

Could the authors elaborate on the level of thread-level parallelism (TLP) or wavefront occupancy required to effectively hide the latency introduced by the lazy execution model? At what point (e.g., in low-occupancy kernels) does the overhead of laziness start to outweigh its memory-saving benefits?
The paper contrasts lazy execution with eager approaches like prefetching. How do the authors envision LazyGPU interacting with a conventional hardware prefetcher? Would the prefetcher need to be disabled, or could the two mechanisms be made to work synergistically (e.g., by having the lazy unit inform the prefetcher)?
The instruction-aware optimization for eliminating dead memory requests is very compelling. Have the authors considered the potential for applying this optimization to a wider range of instructions beyond multiply-based ones (e.g., logical operations, shifts)? What is the estimated potential of such generalizations?

Review 3

Review Form: The Innovator

Summary

The authors propose LazyGPU, a GPU microarchitecture designed to mitigate memory contention by fundamentally changing when memory requests are issued. The core idea is to employ "lazy execution" for memory instructions, deferring the issuance of a load request from the decode/issue stage until the point where a subsequent instruction actually requires the data. The paper presents this core idea in three stages:

LazyCore: A baseline implementation of lazy execution on a GPU, which reorders memory requests by prioritizing those that are blocking computation.
LazyCore+①: The integration of lazy execution with a Zero Cache. The delay inherent in lazy execution provides a natural window to query the Zero Cache first, allowing memory requests to be eliminated non-speculatively if the required data is all-zero.
LazyGPU (LazyCore+①②): An additional optimization that leverages the lazy execution window to inspect the consuming instruction. If the consumer is an instruction like multiplication and its other source operand is zero, the memory request for the value to be multiplied is eliminated as "dead."

The authors evaluate this architecture on a range of benchmarks, with a focus on sparse neural networks like ResNet-18 and LLaMA 7B, demonstrating significant speedups by reducing memory system pressure.

Strengths

The primary strength of this work lies in the synergistic combination of pre-existing concepts. The authors correctly identify that the principal weakness of eager execution—issuing memory requests that may later prove unnecessary—can be addressed by the principal strength of lazy execution—delay. The most novel insight is using the delay window created by lazy execution to enable more effective, non-speculative application of other known optimization techniques (namely, Zero Caches and dead value elimination). This creates a powerful feedback loop: lazy execution enables better filtering of memory requests, which in turn reduces the memory contention that lazy execution was designed to mitigate in the first place.

Weaknesses

My analysis focuses exclusively on the novelty of the proposed ideas, measured against the body of prior art. While the combination of techniques is interesting, the novelty of the constituent parts is limited.

The Concept of Lazy Execution is Not New: The authors themselves cite "LaZy superscalar" [8] (Aşılıoğlu et al., ISCA 2015), which introduced this concept for CPU architectures. The claim of novelty in the present work, therefore, rests on the argument that its application to and implementation for GPUs is a novel contribution. The paper claims this is an "underexplored design scheme" for GPUs (Section 2, page 2), but does not sufficiently articulate the unique architectural challenges of the SIMT model that required a fundamentally new solution beyond what was proposed for CPUs.
Zero Caches are Not New: The concept of a cache that stores metadata about zero-value blocks is well-established. The authors cite the foundational works by Dusser et al. [26] (ICS 2009) and Islam and Stenstrom [36] (PACT 2009). In those works, the zero-check often runs in parallel to a main memory request, which is then cancelled. The "delta" here is that LazyGPU's delay makes this check-then-issue flow non-speculative. This is a clever integration, but it is an incremental refinement of how to use a Zero Cache, not a new concept in itself.
Instruction-Aware Elimination of Memory Requests is Conceptually Similar to Prior Work on Sparsity: The core idea of optimization ②—eliminating a load because it will be used in a multiply-by-zero operation—is a form of dynamic dead value identification. This is conceptually related to prior work on sparsity-aware processing. For example, "SAVE: Sparsity-aware vector engine" [29] (Gong et al., MICRO 2020) proposed a mechanism for CPUs to skip computation and memory accesses for operations involving zero-valued data by tracking data validity. While the mechanism in LazyGPU (tied to the lazy execution pipeline) is different from SAVE's, the high-level goal of exploiting zero-valued operands to eliminate work is identical. The paper needs to more clearly differentiate its contribution from this and other sparsity-aware execution paradigms.

The novelty of this paper is therefore not in any single primitive, but entirely in the specific integration of three known ideas. The significance of the contribution hinges on whether this integration is non-obvious and solves unique challenges specific to the GPU domain.

Questions to Address In Rebuttal

Regarding Novelty over SAVE [29]: The optimization to eliminate memory requests for operands of instructions like multiply-add when another operand is zero (optimization ②) appears functionally similar to the goals of SAVE. Please clarify the fundamental novelty of your approach. Is the primary contribution the non-speculative nature of the optimization, which is enabled by the lazy execution pipeline, thus avoiding the potential complexities of speculation-and-recovery mechanisms?
Regarding Novelty of Lazy Execution on GPUs: The foundational concept of lazy execution was proposed for CPUs in [8]. Beyond stating that this is "underexplored" for GPUs, please elaborate on the specific, novel microarchitectural contributions required to adapt this concept to a massively parallel SIMT architecture. For instance, what challenges arose in managing pending requests for an entire wavefront versus a single thread, and how does your design for storing request information in the physical register file (Figure 6, page 6) represent a novel solution to these challenges?
Regarding the Complexity/Benefit Trade-off: The claimed hardware overhead of 0.009% of the total die size (Section 5.5, page 13) seems exceptionally low, given that it requires adding state (busy bits) to a large physical register file and logic to store partial address information. Could you provide a more detailed breakdown of this cost, perhaps relative to the area of the SM or the register file itself, rather than the entire die? A small percentage of a large die can still be a significant absolute area, and a more contextualized figure is needed to properly evaluate the novelty of the implementation's efficiency.

Evaluating Ruche Networks: Physically Scalable, Cost-Effective, Bandwidth-Flexible NoCs

Abstract

2- D mesh has been widely used as an on-chip network topology, because of its low design complexity and physical scalability. However, its poor latency and throughput scaling have been well-noted in the past. Previous solutions to overcome its ...

ACM LINK

Reviews

Review 1

Reviewer Persona: The Guardian (Adversarial Skeptic)

Summary

This paper presents an evaluation of Ruche Networks, an on-chip network topology that augments a standard 2-D mesh with uniform, long-range "express" links. The authors posit that this approach retains the physical design advantages of a mesh while overcoming its performance scaling limitations. Through RTL-level simulations with both synthetic and benchmark-driven traffic, the authors conclude that Ruche Networks are superior to conventional 2-D mesh and 2-D torus topologies in terms of performance, power, area, and cycle time. The core argument rests on the idea that adding physical express links is a more cost-effective method for improving performance than implementing virtual channels (VCs) as required by a torus.

Strengths

The evaluation methodology is grounded in RTL-level implementations of the router microarchitectures, which provides a more credible basis for area, timing, and power analysis than high-level simulation models.
The paper commendably uses a combination of synthetic traffic patterns (Section 4.1) and a comprehensive suite of execution-driven parallel benchmarks (Section 4.6), allowing for both micro-architectural stress testing and system-level performance characterization.
The analysis in Section 4.5 and Table 4, which explicitly considers the relationship between bisection bandwidth, memory bandwidth, and network aspect ratio, is a valuable and clear-headed piece of analysis that helps frame the design space.
The specific performance diagnosis for the half-torus on the Jacobi benchmark (Section 4.6), where nearest-neighbor communication becomes a worst-case scenario, is an insightful observation that lends credibility to the simulation framework.

Weaknesses

Despite its strengths, the paper's central claims of cost-effectiveness and superiority are built on a foundation of questionable assumptions and critical omissions in the analysis. My primary concerns are as follows:

Critically Flawed Area and Energy Accounting: The paper's primary argument is that Ruche is "cost-effective." However, the cost analysis is fundamentally flawed.
- The area comparison in Figure 7 and Table 2 is explicitly limited to the router logic. It completely ignores the physical area cost of the long-range Ruche channels themselves, including the significant area consumed by repeaters needed to drive these global wires. A network defined by its long-range links cannot have its area cost evaluated while ignoring the area of those links. This omission invalidates the "area efficiency" and "area-normalized speedup" claims (Table 6).
- Similarly, the initial energy analysis in Table 3 explicitly states: "This result does not include the energy dissipated by long-range links outside the tile area." This is a fatal omission. The very mechanism purported to provide benefit is excluded from the cost analysis. While a wire energy model is introduced later in Section 4.9, it appears to be an oversimplified first-order model whose results—that wire energy is a "very small percentage of the total energy" (Figure 13)—are deeply counter-intuitive for long, repeated global wires and require much stronger validation.
Unconvincing Baseline for 2-D Torus: The paper's negative characterization of the 2-D torus relies on a potentially uncharitable baseline implementation.
- The argument in Figure 3c that a VC router must "discard one of the mesh crossbars" is a specific design choice, not an inherent property of VC-based routers. An alternative implementation could have maintained the crossbar bandwidth. This choice seems designed to cripple the torus baseline from the outset.
- The reported saturation throughput for the 16x16 torus under uniform random traffic is only 19% (Figure 6). This figure is suspiciously low for a well-designed torus network and suggests that the baseline may be under-provisioned (e.g., insufficient VCs, suboptimal allocator design) and not representative of a state-of-the-art implementation. The claims of Ruche's superiority are weakened if the comparison is made against a strawman.
Bifurcated and Inadequately Justified Evaluation: The paper abruptly switches from evaluating "Full Ruche" with synthetic traffic (Section 4.1) to evaluating "Half Ruche" for the more realistic benchmark-driven analysis (Sections 4.5-4.9). The justification provided—that all-to-edge traffic only requires horizontal links—is insufficient.
- This split raises immediate suspicion. Why was the supposedly superior Full Ruche topology not carried through to the benchmark evaluation? One might infer that the full cost (in area, power, or routing complexity) of a Full Ruche network was too high to show a benefit in a more realistic setting, which would significantly weaken the paper's overall claims. The authors must demonstrate the performance of Full Ruche on the benchmark suite to present a complete and honest evaluation.
Oversimplified Physical Design Argument: The paper claims Ruche is "physically scalable" based on the regularity of its tile-based layout (Figure 2). This is a superficial argument that ignores the profound physical design challenges of implementing such a topology at scale. The paper fails to discuss or quantify the impact of increased routing congestion from adding numerous global wires, the difficulty of timing closure across these multi-tile links, or the potential for crosstalk and signal integrity issues. The simple repeater model in Section 4.9 is inadequate for addressing these first-order VLSI concerns.

Questions to Address In Rebuttal

Please provide a revised area analysis (akin to Figure 7) that includes the area of the repeaters required for all Ruche links and an estimate for the routing area overhead based on the number of additional wiring tracks consumed. How does this affect the "area-normalized speedup" metric in Table 6?
Please justify your torus router implementation. Specifically, why was the choice made to halve its crossbar bandwidth relative to a multi-mesh (as depicted in Figure 3), and can you provide evidence that its performance (e.g., 19% saturation in Figure 6) is representative of a competitive, modern torus design?
To provide a consistent evaluation, please present the benchmark speedup, latency, and energy results (Figures 10-13) for the Full Ruche topology. If its performance is not superior to Half Ruche, please explain the architectural reasons for this outcome.
The wire energy calculation in Section 4.9 appears to be a primary source of the paper's strong energy efficiency claims. Can you provide a more detailed breakdown of this model, including the assumptions made for repeater sizing, leakage power, and wire parameters (e.g., how was 0.2 pF/mm validated for your target 12nm process)? Please provide a sensitivity analysis showing how total energy changes if wire capacitance/energy-per-bit is 2x or 5x higher than your estimate.

Review 2

Paper Title: Evaluating Ruche Networks: Physically Scalable, Cost-Effective, Bandwidth-Flexible NoCs Reviewer Persona: The Synthesizer (Contextual Analyst)

Summary

This paper presents a comprehensive and compelling evaluation of Ruche Networks, a Network-on-Chip (NoC) topology that augments the standard 2-D mesh with regular, physical, long-range "skip" links. The authors situate their work in the well-known context of the scalability limitations of 2-D mesh, particularly its bisection bandwidth bottleneck, which is increasingly problematic for modern data-intensive manycore architectures. The core contribution is not the invention of Ruche Networks, but rather the first rigorous, RTL-level, execution-driven evaluation that fills a critical gap left by prior analytical work.

Through detailed simulations using both synthetic traffic and a suite of parallel benchmarks, the authors systematically compare Ruche against 2-D mesh and 2-D folded torus—its most practical competitor. Their findings demonstrate that Ruche offers a superior design point, achieving higher throughput and lower latency at a reduced area and power cost compared to a virtual-channel-based torus. The work provides strong evidence that Ruche Networks represent a highly practical and effective solution for scaling on-chip interconnects, preserving the physical design advantages of mesh while overcoming its primary performance bottlenecks.

Strengths

High-Quality and Comprehensive Evaluation: The paper's primary strength is the depth and realism of its evaluation methodology. By moving beyond analytical models and high-level simulations to RTL-level implementations (for area, power, and timing analysis in Section 4.2, Page 5) and full-system, execution-driven simulation (Section 4.6, Page 8), the authors provide a level of evidence that is both convincing and highly valuable to the community. This rigorous approach gives significant weight to their claims of superiority over torus and mesh.
Excellent Contextualization and Problem Framing: The authors do an outstanding job of placing Ruche Networks within the broader landscape of NoC research. The introduction (Section 1, Page 1) astutely points out why historical solutions like concentration and simple channel widening are based on "outdated assumptions" that no longer apply to modern stream-based, data-intensive workloads. The comparison against folded torus is particularly insightful, as it represents the most direct and physically-plausible alternative for adding long-range links to a mesh-like structure. Table 1 (Page 3) provides a clear and useful taxonomy of topologies based on physical scalability criteria.
Focus on Practicality and Physical Design: A key theme of the paper is its grounding in the realities of modern chip design. The authors consistently emphasize that Ruche retains the regular, tileable structure that makes 2-D mesh so popular (Figure 2, Page 3). This focus on physical realizability is a critical differentiator from more esoteric topologies that may look good on paper but are impractical to route on a 2D die. The analysis of "depopulated" crossbars (Figure 5, Page 5) is an excellent example of a practical, cost-saving optimization.
Clear Demonstration of a Superior Design Space: The paper successfully makes the case that Ruche offers a better set of trade-offs than its competitors. It demonstrates that the architectural complexity of virtual channels required for a deadlock-free torus negates many of its theoretical bandwidth advantages (Figure 6, Page 6). Ruche, by contrast, achieves deadlock freedom through simple dimension-ordered routing while using its hardware resources more efficiently to provide higher crossbar bandwidth. The results presented in the energy analysis (Figure 13, Page 11) are particularly striking, showing that half-torus can actually consume more total energy than 2-D mesh due to router overhead, a pitfall that Ruche avoids.

Weaknesses

While this is a strong paper, there are areas where its context and claims could be further broadened and strengthened.

Limited Exploration of Routing Algorithms: The evaluation is exclusively based on Dimension-Ordered Routing (DOR). While DOR provides a simple and effective deadlock-free mechanism, it is unable to route around congestion. The addition of numerous long-range links in the Ruche topology seems to create a path diversity that is ripe for exploitation by adaptive routing algorithms. A discussion of how Ruche might perform with even a simple adaptive scheme would provide a more complete picture of its potential. Without this, it's unclear if the full capability of the added physical links is being realized under heavy, non-uniform traffic.
The Premise of Underutilized Wiring: The justification for adding Ruche links rests on the premise, cited from [27], that 2-D meshes typically underutilize available wiring tracks between tiles. While this is a plausible and well-established observation, the paper would be stronger if it provided some quantitative data from its own physical design flow to support this. For example, showing wiring congestion maps or utilization statistics for a baseline mesh versus a Ruche network (e.g., RF=3) would turn this premise from a cited fact into a demonstrated reality within the context of their own experiments. How close to the practical wiring limit does a high Ruche Factor push the design?
Narrow Quantitative Comparison to Other Express Topologies: The paper provides a good qualitative comparison to topologies like MECS and Flattened Butterfly in Section 3 (Page 3). However, the quantitative evaluation is limited to mesh and torus. While a full RTL-level comparison is likely out of scope, including even a high-level simulation-based comparison against a topology like MECS could help readers better situate Ruche's performance. Is Ruche's advantage due to its constant-radix routers, its specific link placement, or both? A broader comparison could clarify the specific sources of its efficiency.

Questions to Address In Rebuttal

The choice of DOR is practical, but could you comment on the potential of using adaptive routing with Ruche Networks? Given that Ruche routers are simpler and faster than VC-based routers, would adding the minimal logic for adaptivity (e.g., extra VCs for deadlock avoidance) still result in a more efficient design point than the highly complex allocators found in a baseline torus?
Your work is predicated on the availability of VLSI wiring resources to implement the long-range Ruche links. Can you provide any concrete data from your place-and-route experiments regarding wire track utilization or routing congestion? This would significantly bolster the argument that Ruche is not just performant but also physically non-disruptive to implement.
The comparison to 2-D torus is well-motivated and excellent. Could you elaborate on the decision not to include a quantitative performance comparison against other express-link topologies like MECS? Would the increasing router radix of MECS make it a non-starter from an area/timing perspective even in the 16x16 networks you evaluated?

Review 3

Paper Title: Evaluating Ruche Networks: Physically Scalable, Cost-Effective, Bandwidth-Flexible NoCs Reviewer Persona: The Innovator (Novelty Specialist)

Summary

The paper presents a detailed evaluation of Ruche Networks, a 2D mesh topology augmented with regular, equidistant long-range physical links. The authors implement RTL-level routers for Ruche, 2D mesh, and 2D torus topologies, and perform a comparative analysis based on synthetic traffic and execution-driven simulations of parallel workloads. The study provides a characterization of Ruche Networks in terms of performance (latency, throughput), area, power, and cycle time, arguing that Ruche provides a superior trade-off compared to mesh and torus.

The core architectural concept, "Ruche Networks," was previously proposed by the same authors in [15] and [25]. The primary contribution of this work is therefore not the introduction of a new network topology, but rather its comprehensive, hardware-level characterization and comparison against established alternatives. The novel claims are centered on the experimental insights derived from this evaluation, such as the effectiveness of depopulated crossbars and the scalability benefits of the Ruche Factor.

Strengths

The strength of this paper lies in its rigorous and detailed evaluation methodology, which moves significantly beyond the analytical models presented in the authors' prior work. The provided RTL-level implementations, synthesis results for area and cycle time (Figure 7, Page 7), and power analysis (Table 3, Page 7) provide a concrete and valuable grounding for the architecture's claims. This level of detail is essential for transitioning an architectural concept from a theoretical proposal to a viable engineering solution. The comparison against a virtual-channel-based 2D torus is particularly useful, as it directly contrasts two distinct methods for achieving deadlock-free, long-range connectivity.

Weaknesses

From the perspective of conceptual novelty, the paper's contribution is limited.

Recycled Core Idea: The central architectural idea—the Ruche topology—is not new. It was introduced in the authors' previous publications, specifically [15] "Ruche Networks: Wire-Maximal, No-Fuss NoCs" (NOCS 2020) and [25] "Implementing Low-Diameter On-Chip Networks..." (NOCS 2020). This paper is explicitly positioned as an evaluation of a known entity, making it an incremental contribution rather than a foundational one.
Well-Established General Concept: The broader concept of augmenting a 2D mesh with physical express links or bypass channels is a well-explored area in the network-on-chip literature. Topologies like Flattened Butterfly [17], MECS [12], and various other hierarchical or express-link-based designs have long sought to reduce the diameter of mesh networks. While Ruche offers a specific, physically-aware implementation with its equidistant links and constant-radix routers, it exists within this established paradigm. The paper does not introduce a fundamentally new way of thinking about network topology.
Standard Microarchitectural Optimizations: The proposed "depopulated" router variant (Figure 5, Page 5) is a direct application of a standard design practice. Router crossbars are commonly optimized by removing paths that are illegal under the chosen routing algorithm (in this case, DOR). This is a well-known technique to reduce area and power and does not constitute a novel microarchitectural contribution. The paper's contribution here is merely the quantification of this standard technique in the context of Ruche.

In essence, the paper does an excellent job of evaluating an existing idea but presents little in the way of new conceptual frameworks, algorithms, or architectural primitives.

Questions to Address In Rebuttal

The authors explicitly state that this paper aims to fill the evaluation gap left by their prior work [15, 25]. Beyond demonstrating that Ruche performs well, what is the single most significant and surprising conceptual insight derived from this evaluation? That is, what fundamental trade-off or principle did this hardware-level study reveal that was not already predictable from the high-level concept?
The core idea of Ruche is the addition of regular, equidistant "skip" links. This is topologically similar to other regular graph structures, such as k-ary n-cubes with additional chords. Could the authors articulate the fundamental topological novelty of Ruche that distinguishes it from this broader class of networks, beyond the specific tile-based physical implementation methodology?
One of the key results is that a simple Ruche configuration (e.g., RF=2, depopulated) yields most of the performance gains. While a valuable engineering guideline, this outcome seems predictable: adding any low-cost bisection bandwidth to a bisection-limited mesh should yield significant returns, with diminishing gains thereafter. Can the authors argue why this result is a novel finding rather than an empirical confirmation of first-order network theory?

Garibaldi: A Pairwise Instruction-Data Management for Enhancing Shared Last-Level Cache Performance in Server Workloads

Abstract

Modern CPUs suffer from the frontend bottleneck because the instruction footprint of server workloads exceeds the private cache capacity. Prior works have examined the CPU components or private cache to improve the instruction hit rate. The large ...

ACM LINK

Reviews

Review 1

Paper Title: Garibaldi: A Pairwise Instruction-Data Management for Enhancing Shared Last-Level Cache Performance in Server Workloads Reviewer: The Guardian

Summary

This paper identifies the "instruction victim problem" in server workloads, where instruction cachelines are evicted from the shared Last-Level Cache (LLC) due to contention with data cachelines. The authors argue that this is detrimental because an instruction miss can stall the frontend, preventing access to even hot, already-cached data. They propose Garibaldi, a pairwise instruction-data management scheme for the LLC. The mechanism introduces a "pair table" to track the relationship between instruction lines and the data lines they access. The hotness of data accesses (determined by hit/miss status under a baseline replacement policy) is used to calculate a "miss cost" for the corresponding instruction. Instructions with a high miss cost are selectively protected from eviction using a query-based mechanism. Additionally, the scheme prefetches data associated with unprotected instructions upon an instruction miss. The evaluation, performed on a simulated 40-core system, claims significant performance improvements over state-of-the-art LLC management schemes like Mockingjay.

Strengths

Problem Motivation: The paper provides a compelling motivation. Figure 1 (page 1) effectively illustrates that the instruction fetch (ifetch) component of CPI is a significant bottleneck in server workloads, particularly in multi-core configurations, which is often overlooked in traditional LLC management schemes focused solely on data.
Core Concept: The fundamental idea of creating a feedback loop from data access outcomes back to instruction management at the LLC level is logical. Decoupling instruction and data management, as is traditionally done, ignores their inherent dependency, and this work attempts to address that.
Evaluation Scope: The authors have conducted an extensive evaluation, including 16 server workloads, comparisons against multiple state-of-the-art policies (DRRIP, Hawkeye, Mockingjay), and a battery of sensitivity studies.

Weaknesses

My analysis finds that while the motivation is sound, the proposed mechanism rests on several questionable assumptions and its claimed benefits are not as robustly supported as presented.

Gross Oversimplification of the "Miss Cost" Metric: The paper's central concept of "opportunity cost" is implemented as a simple n-bit saturating counter, incremented on a data hit and decremented on a data miss (Section 4.1, page 6). This is a fundamentally flawed proxy for cost. A data miss that goes to main memory incurs a penalty hundreds of cycles longer than an LLC hit. Treating these two events as symmetric opposites (+1/-1) fails to capture the non-linear and highly variable cost of memory accesses. An instruction leading to one catastrophic data miss is far more costly than one leading to several data hits. The mechanism, as described, cannot distinguish this and is therefore unlikely to accurately identify the most critical instructions.
Fragility and Questionable Scalability of the Pair Table: The core hardware structure, the pair table, appears to be both fragile and a potential scalability bottleneck.
- The performance of Garibaldi is critically dependent on a large pair table. The sensitivity study in Figure 14(c) (page 12) shows a steep performance decline as the table size is reduced from the 16K-entry default. This undermines the claim of a modest "0.6% of LLC capacity" overhead (page 10), as this overhead is not optional but essential for the mechanism to function.
- The tracking mechanism relies on a per-core "helper table" to map a PC to an instruction physical address (IL_PA) (Section 5.1, Figure 8, page 8). The paper is completely silent on the miss rate of this helper table and the performance consequences of a miss. If this lookup fails, the entire chain of correlating a subsequent data access back to its instruction line breaks down. This is a critical omission that calls the viability of the entire tracking scheme into question.
Unconvincing Experimental Comparisons and Analysis: The evaluation contains analyses that weaken, rather than strengthen, the paper's claims.
- The comparison against partitioning-based protection in Figure 14(d) (page 12) appears to be a strawman argument. The authors themselves state in Section 2.2 that way-partitioning is challenging in modern processors with high core counts and low associativity. To then implement a simplistic version and show it performs poorly does not validate Garibaldi; it merely confirms a known limitation of a naive approach. A more rigorous comparison against a state-of-the-art dynamic partitioning scheme is required.
- The analysis of the paired-data prefetcher is insufficient. Figure 14(a) (page 12) shows that performance peaks when tracking only one or two data lines (k=1, 2) and then degrades. The authors offer no substantive explanation for this degradation. Is it due to prefetch-induced cache pollution? What are the accuracy, coverage, and timeliness metrics of this prefetcher? Without this analysis, the prefetching component feels ad-hoc and its benefit is not rigorously established.
Incomplete Problem Diagnosis: The paper attributes the high ifetch CPI from Figure 1 primarily to LLC instruction misses without providing a rigorous breakdown. Frontend stalls are a complex phenomenon resulting from L1-i cache misses, L2 misses, branch mispredictions, and other pipeline hazards. The paper does not quantify what fraction of these stalls are uniquely attributable to LLC misses. Without this data, it is impossible to assess whether an LLC-level intervention like Garibaldi is the most effective solution, or if the observed problem is better addressed at the L1/L2 level or via improved branch prediction.

Questions to Address In Rebuttal

The authors must address the following points directly to demonstrate the technical soundness of their work:

On "Miss Cost": Please provide a quantitative justification for using a simple, symmetric +/-1 counter as a proxy for "miss cost." Show data comparing this metric's predictions of instruction criticality against a more realistic oracle based on actual stall cycles incurred by dependent data accesses.
On Helper Table Failures: What is the measured miss rate of the per-core helper tables in your simulations? Crucially, what is the defined behavior and performance impact when a helper table lookup fails during a data access, preventing the correlation to an instruction line?
On Prefetcher Performance: Please provide a detailed analysis explaining the performance degradation seen in Figure 14(a) for k > 2. This analysis must include metrics for prefetcher accuracy and the extent of cache pollution caused by the additional prefetches.
On Problem Attribution: Please provide a detailed breakdown of the ifetch stall cycles reported in Figure 1. What percentage of these stalls are caused by L1-i misses that hit in L2, L2 misses that hit in the LLC, and finally, LLC misses that go to memory? This data is essential to validate the premise that the LLC is the primary locus of the problem.

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

The authors address the increasingly critical problem of frontend stalls in server workloads, specifically targeting instruction misses in the shared Last-Level Cache (LLC). They observe that modern, data-centric LLC management policies overlook the high cost of an instruction miss, which stalls the pipeline even if the data it needs is already resident in the cache.

The central contribution is "Garibaldi," a novel management scheme that redefines the value of an instruction cache line not by its own reuse characteristics, but by the "opportunity cost" associated with the hot data it subsequently accesses. It achieves this through a "pairwise" mechanism that tracks instruction-data relationships, propagates data hotness (as determined by the underlying cache replacement policy) back to the corresponding instruction, and uses this information to selectively protect these high-cost instructions from eviction. The mechanism also includes a conservative prefetcher for paired data when an unprotected instruction misses. Evaluated on top of a state-of-the-art LLC policy (Mockingjay), Garibaldi demonstrates a significant 6.1% average performance improvement for server workloads, highlighting the efficacy of this new perspective.

Strengths

Novel and Insightful Problem Framing: The paper’s primary strength lies in its excellent conceptualization of the "instruction victim" problem at the LLC level. The authors compellingly argue that the cost of an instruction miss is intrinsically linked to the hotness of the data it gates. The CPI stack analysis in Figure 1 (page 1) and the reuse distance breakdown in Figure 3 (page 4) provide a strong empirical foundation for this claim, effectively demonstrating that a significant and growing bottleneck exists where the authors propose to intervene.
An Elegant Conceptual Bridge: This work beautifully synthesizes concepts from two often-disparate domains of computer architecture: frontend pipeline optimization and shared cache management. By bringing awareness of frontend stalls into the LLC's policy decisions, Garibaldi bridges a crucial gap. The analogy drawn between their "instruction victim" and the well-known "inclusion victim" problem (citing work like QBS [33] from which they draw inspiration for their query-based mechanism) is particularly sharp and helps to situate the contribution within a broader theoretical context.
Pragmatic and Orthogonal Design: The proposed mechanism is well-considered from an implementation standpoint. It cleverly piggybacks on the hit/miss signals from existing advanced replacement policies (like Mockingjay) to define data "hotness," making the design orthogonal and broadly applicable. The details, such as the Pair Table, the dynamic threshold adjustment via coloring (Section 5.2, page 8), and the decoupled storage for data addresses, demonstrate a thoughtful approach to balancing effectiveness with hardware overhead, which is analyzed and shown to be modest (Table 2, page 10).
Strong and Convincing Results: The performance improvements are not trivial. Achieving an additional 6.1% speedup on top of a powerful baseline like Mockingjay is a significant result in the context of LLC management. The per-workload analysis in Section 7.2 (page 11) effectively links performance gains to reductions in ifetch stall cycles, closing the loop and validating the paper's central hypothesis.

Weaknesses

While the core idea is strong, its presentation and exploration could be broadened to better understand its position in the design space.

Limited Scope of the "Pairing" Concept: The pairing mechanism, which links an instruction PC to the physical addresses of the data it accesses, is effective but may be a simplification of real-world program behavior. The paper does not deeply discuss how the mechanism copes with instructions involved in complex access patterns, such as pointer-chasing or accesses to highly dynamic data structures, where a single static instruction may touch a wide and changing set of data lines. The reliance on a small, fixed number of tracked data lines per instruction (k=1) is justified for latency but its implications for these more complex patterns are underexplored.
Under-examined Interplay with Advanced Prefetchers: The baseline system rightly includes advanced instruction (I-SPY) and data (GHB) prefetchers. However, the interplay between these components and Garibaldi is a rich area for discussion that is largely absent. For instance, a perfect instruction prefetcher would obviate the need for Garibaldi's protection mechanism entirely. It would be valuable to understand if Garibaldi is primarily compensating for prefetcher inaccuracies or if there is a more fundamental synergy. Could Garibaldi's pairwise information, for example, be used to create an even more powerful, data-aware instruction prefetcher?
Dependence on a Specific Workload Characteristic: The mechanism is predicated on the "many-to-few" access pattern (many cold instructions, few hot data) identified in server workloads. The paper correctly notes that in workloads like kafka, where both instruction and data reuse are low, the trade-off made by Garibaldi is not beneficial (Section 7.2, page 11). This suggests that the system's effectiveness is highly dependent on this characteristic. The work would be strengthened by a more formal characterization of the workload properties that define its ideal operating conditions and a discussion of potential dynamic mechanisms to modulate Garibaldi's aggressiveness in phases where its core assumptions do not hold.

Questions to Address In Rebuttal

The pairwise tracking currently maps an instruction line to a small, fixed number of data lines (k=1 in the final evaluation). Could the authors elaborate on how the scheme would handle instructions with more complex data access patterns (e.g., pointer chasing through a list, where data addresses are not stable)? Does the mechanism risk "polluting" its pairing information in such cases, and is the aging mechanism sufficient to manage this?
The baseline configuration includes an advanced instruction prefetcher (I-SPY). Could the authors discuss the interplay between Garibaldi and instruction prefetching in more detail? For example, does Garibaldi primarily rescue misses that the prefetcher fails to cover, or is there a more complex interaction? Could the pair table's data address information be used to create a more powerful, data-aware instruction prefetcher?
The analysis for the kafka workload (Section 7.2, page 11) is insightful, showing that the mechanism is not beneficial when both instruction and data reuse is low. Could the authors propose a dynamic mechanism or heuristic to detect such application phases and perhaps throttle or disable Garibaldi's protection policy to avoid degrading data cache performance in these scenarios? For example, could the P(D_miss | I_miss) metric used for threshold adjustment also serve this purpose?

Review 3

Review Form: The Innovator

Summary

The authors present "Garibaldi," a hardware mechanism designed to mitigate the "instruction victim problem" in the shared Last-Level Cache (LLC) for server workloads. They observe that cold instruction lines are often evicted to make space for hot data lines, leading to frontend stalls when those instructions are needed again, even if the data they would access is already resident in the LLC.

The core claim of novelty lies in a "pairwise instruction-data management" scheme. This scheme explicitly tracks the relationship between an instruction cache line and the data cache lines it subsequently accesses. The "hotness" of data accesses (as determined by whether they hit or miss in the LLC under a baseline policy) is used to compute a "miss cost" for the corresponding instruction line. This miss cost is stored in a dedicated "pair table" and serves two purposes: 1) to selectively protect high-cost instruction lines from eviction using a query-based mechanism, and 2) to trigger a prefetch of associated (cold) data lines when an unprotected instruction line misses.

Strengths

Novel Inversion of Information Flow: The central concept of propagating data cacheline hotness back to the instruction cacheline is a conceptually novel approach. The vast majority of prior art in PC-based cache management and prefetching uses the instruction's identity (the PC) to predict the future behavior of data (e.g., Hawkeye [32], Mockingjay [56]). This paper inverts this dependency, using the observed behavior of data to inform the cache management policy for the instruction. This represents a genuine, albeit subtle, shift in perspective.
Problem Formulation: The paper provides a clear and well-motivated articulation of the "instruction victim problem" at the LLC level. The insight that an instruction miss can be more costly than a data miss if the data is already resident is a sharp observation that effectively frames the need for a new solution.
Integrated Mechanism: The use of a single new structure, the pair table, to drive both selective protection and prefetching is an elegant design choice from a conceptual standpoint. It treats the instruction-data pair as a single entity to be managed holistically.

Weaknesses

Constituent Mechanisms are Not Novel: While the synthesis is new, the underlying building blocks are well-established.
- Query-Based Protection: The selective protection mechanism is an explicit adaptation of Query-Based Selection (QBS), which the authors correctly cite from Jaleel et al. [33]. The novelty is not in the mechanism itself, but in the new metric used to drive it (the "miss cost").
- PC-to-Data Association: Tracking associations between a PC and the data addresses it touches is the foundational principle of many advanced prefetchers and cache replacement policies (e.g., [32], [72]). The pair table is, in essence, a hardware structure to cache these associations, which is not a fundamentally new idea.
- Paired Prefetching: The idea of prefetching a data line upon an access to a related (instruction) line is a form of group or correlated prefetching. The novelty here is slim, resting on the trigger (an instruction miss) and the source of the correlation (the dynamically populated pair table).
High Implementation Complexity for the Achieved Gain: The proposed hardware is far from trivial. As detailed in Table 2 (page 10), Garibaldi requires approximately 194KB of SRAM storage for a 40-core configuration. This includes a 120KB main pair table, 40KB of per-core helper tables, and a 32KB D_PPN table. This represents a significant area and power overhead. The LLC controller logic is also made more complex, requiring new lookup paths and dynamic threshold management. The geometric mean speedup of 6.1% over a very strong baseline (Mockingjay) is commendable, but it is debatable whether this performance gain justifies the substantial new hardware complexity. The novelty of the idea is tempered by the brute-force nature of its implementation.
The "Delta" Over Prior Art is Primarily in Application, Not Invention: The core contribution is the creation of the "miss cost" metric and its application to existing management techniques. This is an engineering innovation rather than a foundational one. The paper does not propose a new caching theory or a fundamentally new architectural paradigm, but rather a new heuristic to plug into existing ones. For a mechanism of this complexity, a more significant conceptual leap would be expected.

Questions to Address In Rebuttal

The novelty rests on using data hit/miss status to manage instruction lines. How is this fundamentally different from a feedback-directed approach where the LLC replacement policy simply learns that PCs leading to data hits are "high value" and should be prioritized? Could a simpler scheme that just increments a priority counter for an instruction line's PC upon a subsequent data hit achieve a similar outcome with less hardware?
The paper proposes a large, centralized pair table that tracks physical addresses. This requires a complex helper table structure to map a data access's PC (virtual address) to its instruction line's physical address (IL_PA), as shown in Figure 8 (page 8). This seems to introduce a multi-step, latency-intensive process into the LLC controller. Please clarify the latency and critical path implications of this lookup process required on every data access that updates the pair table.
Given the high cost of the proposed structure, have the authors considered a more lightweight approach? For example, could the instruction-data pairing information be stored directly within the LLC tags/metadata for resident instruction lines, avoiding the need for a large, decoupled pair table? This would limit tracking to cached instructions but would dramatically reduce the area overhead. What is the justification for the large, off-path table structure versus a more integrated, albeit less comprehensive, approach?

NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems

Abstract

Multiple Graphics Processing Units (GPUs) are being integrated into systems to meet the computing demands of emerging workloads. To continuously support more GPUs in a system, it is important to connect them efficiently and effectively. To this end, ...

ACM LINK

Reviews

Review 1

Paper Title: NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present NetCrafter, a system comprising three techniques—Stitching, Trimming, and Sequencing—designed to optimize network traffic in multi-GPU systems with non-uniform interconnect bandwidths. The motivation stems from the observation that inter-cluster links are significant bottlenecks. The proposed techniques aim to improve bandwidth utilization by combining partially-filled flits (Stitching), reduce total traffic by only sending requested portions of cache lines (Trimming), and prioritize latency-critical page table walk traffic (Sequencing). The evaluation, conducted using the MGPUSim simulator, claims an average speedup of 16% over a baseline non-uniform configuration.

While the problem is relevant, this paper's central claims rest on a series of questionable assumptions and methodological choices. The analysis of performance trade-offs is superficial, particularly regarding the Trimming mechanism's impact on spatial locality and the latency implications of Flit Pooling. Furthermore, the work's applicability is severely limited by its simplistic memory and coherency model, which ignores the complexities of modern hardware-coherent systems.

Strengths

Well-Motivated Problem: The paper correctly identifies that non-uniform interconnect bandwidth in emerging multi-GPU systems (e.g., Frontier) is a critical performance bottleneck.
Systematic Observations: Section 3 provides a clear, data-driven analysis that forms the basis for the three proposed techniques. The identification of unused bytes in flits, partial cache line utilization, and the critical nature of PTW traffic are logically presented.
Component-level Sensitivity: The authors perform some sensitivity analysis, for instance, on the Flit Pooling delay (Section 5.4, Figures 18 and 19), which demonstrates an attempt to justify design parameter choices.

Weaknesses

Fundamentally Flawed Rationale for Trimming: The paper’s core defense of Trimming rests on a weak and unsubstantiated claim. The authors state that since Trimming is only applied to inter-cluster requests, it "does not entirely negate the spatial locality benefits of natural fetching of the cacheline" (Section 4.3, pg. 7). This is a hand-wavy dismissal of a critical performance principle. Spatial locality is valuable regardless of whether the data resides on a local or remote cluster. By truncating cache line transfers to 16 bytes, the authors are explicitly gambling that no other data in that line will be needed, destroying any potential prefetching benefit. The evaluation in Section 5.3 (Figure 16) compares against a strawman "all-trimming" sector cache baseline. A proper evaluation would quantify the performance lost due to spoiled spatial locality on inter-cluster requests and compare against more intelligent hardware prefetching mechanisms that would be negatively impacted by this design.
Unjustified Latency Cost of Flit Pooling: The Stitching mechanism requires a companion technique, Flit Pooling, which intentionally stalls flits in a queue for up to 32 cycles hoping a merge candidate appears. The authors mitigate the obvious performance risk by creating "Selective Flit Pooling," which exempts PTW-related flits (Section 4.2, pg. 7). This solution is overly simplistic. It assumes that PTW requests are the only form of latency-critical traffic in the system. What about other synchronization primitives, atomic operations, or critical metadata reads that are not part of a PTW? These would be unduly delayed, and the paper provides no analysis of the latency impact on the broader distribution of non-PTW network packets. The claim that this delay is acceptable is not sufficiently supported.
Grossly Simplified Coherency and Memory Model: The paper's evaluation and design assume a software-managed coherence model and explicitly state that "remote L2 data is not cached in the local L2 partition" (Section 2.1, pg. 3). This is a profound simplification that invalidates the work's applicability to many state-of-the-art and research systems that employ hardware coherence. Hardware coherence protocols (e.g., MESI variants) would generate a high volume of small, latency-sensitive control packets (e.g., invalidations, writebacks, acknowledgments). The paper completely ignores this entire class of network traffic. How would these critical control packets interact with Flit Pooling? Would an invalidation message be delayed by 32 cycles? The claim in Section 4.5 (pg. 9) that NetCrafter "can also seamlessly complement any underlying hardware coherence mechanisms" is pure speculation and is not backed by a single piece of evidence or analysis.
Baseline Configuration Exaggerates Benefits: The primary evaluation is performed on a system with an 8:1 bandwidth ratio (128 GB/s intra-cluster vs. a meager 16 GB/s inter-cluster). While such asymmetry exists, this extreme configuration provides a near-perfect environment for traffic reduction techniques to show benefit. The performance gains are likely inflated by this choice. The sensitivity study in Figure 22 shows, as expected, that the benefits shrink as the ratio becomes less skewed. The headline 16% average speedup is therefore conditional on a highly bottlenecked network that may not be representative of all designs.

Questions to Address In Rebuttal

On Trimming and Spatial Locality: Please provide a quantitative analysis of the L1 cache miss penalty and overall performance impact resulting from the destruction of spatial locality for inter-cluster requests. How does your selective Trimming compare against a baseline that has a standard prefetcher active for all requests, including inter-cluster ones?
On Flit Pooling Latency: The "Selective" aspect of Flit Pooling only protects PTW traffic. What is the empirical evidence to support the assumption that no other traffic is latency-sensitive? Please provide data showing the latency distribution of non-PTW packets with and without the 32-cycle pooling delay.
On Coherency Model: The claim of compatibility with hardware coherence is unsubstantiated. Please articulate precisely how NetCrafter's mechanisms (Stitching, Pooling, and Sequencing) would classify and handle the control and coherence traffic (e.g., invalidations, upgrades, acknowledgments) generated by a directory-based coherence protocol. Would these critical messages be delayed by Flit Pooling?
On Performance Attribution: In Figure 14, the final results bar combines the effects of "Stitching + Trimming + Sequencing." To properly assess the contributions, please provide a performance breakdown that shows the incremental speedup of each technique individually over the baseline (i.e., Baseline vs. Baseline+Stitching, Baseline+Stitching vs. Baseline+Stitching+Trimming, etc.). This is essential for understanding which components of NetCrafter are providing the claimed benefit.

Review 2

Paper: NetCrafter: Tailoring Network Traffic for Non-Uniform Bandwidth Multi-GPU Systems Review Form: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the performance bottlenecks arising from non-uniform interconnect bandwidth in modern hierarchical multi-GPU systems, a design pattern exemplified by HPC systems like Frontier. The authors identify that the lower-bandwidth links connecting clusters of GPUs are a primary source of contention and performance degradation.

The core contribution is NetCrafter, a suite of three complementary, flit-level network traffic management techniques designed to improve the efficiency of these constrained links. The techniques are: 1. Stitching: Combines partially filled flits destined for the same location to reduce padding/overhead and improve link utilization. 2. Trimming: Fetches only the necessary bytes of a cache line (e.g., 16B instead of 64B) over the slow links, reducing total data transfer volume for requests with low spatial locality. 3. Sequencing: Prioritizes latency-critical network traffic, specifically Page Table Walk (PTW) related packets, over bulk data transfers to prevent head-of-line blocking on critical operations.

Through simulation on a Frontier-like multi-GPU model, the authors demonstrate that NetCrafter achieves an average performance improvement of 16% and up to 64% across a diverse set of applications.

Strengths

The primary strength of this work lies in its pragmatic and well-motivated approach to a real, timely, and increasingly critical problem.

Excellent Problem Contextualization: The paper is grounded in the clear architectural trend of building large-scale GPU complexes using hierarchical interconnects. This isn't a purely academic exercise; it's a direct response to the design challenges faced by current and future exascale systems. The motivation is clear, compelling, and well-supported by examples from industry.
Synthesis of Proven Concepts: The work's elegance comes not from a single radical invention, but from the insightful synthesis and application of established networking and architecture principles to the specific domain of multi-GPU systems.
- Trimming is a clever, dynamic, and link-aware application of the core idea behind sectored caches. The decision to only trim on slow inter-cluster links (as discussed on page 7, Section 4.3) is particularly insightful, as it mitigates bandwidth pressure where it matters most while preserving the prefetching benefits of full cache line transfers on high-bandwidth local links.
- Stitching is a flit-level analogue to message coalescing or TCP piggybacking, effectively tackling the classic problem of fragmentation and padding overhead. The addition of "Flit Pooling" to increase stitching opportunities demonstrates a thoughtful design process.
- Sequencing is a direct application of Quality of Service (QoS) principles. The identification of PTW traffic as the most latency-critical component (Observation 3, page 5) is a key insight that allows for a simple yet highly effective prioritization scheme.
Strong, Data-Driven Motivation: Each of the three techniques is justified by a clear observation backed by data presented in the Motivation and Analysis section (Section 3, pages 4-5). This foundational analysis (e.g., Figures 6 and 7 showing underutilized flits and cache lines) makes the subsequent design choices feel logical and well-founded, rather than arbitrary.
Thorough and Rigorous Evaluation: The experimental methodology is robust. The authors use a respected simulator (MGPUSim), a relevant system configuration, and a diverse set of workloads. The sensitivity studies on Flit Pooling delay (Section 5.4), flit size, and especially the varying bandwidth ratios (Figure 22, page 12) are crucial for establishing the generality and robustness of the proposed solution. The direct comparison of their Trimming approach against a standard sectored cache baseline (Figure 16, page 11) is a particularly strong piece of analysis that validates their nuanced design.

Weaknesses

The weaknesses of the paper are minor and relate more to its positioning and potential future scope rather than fundamental flaws in the core idea.

Incremental Nature of Individual Components: While the synthesis is novel, the constituent ideas are conceptually related to prior work in broader fields. Trimming relates to sectored caches, Stitching to packet packing, and Sequencing to QoS. The paper would be slightly stronger if it more explicitly framed its contribution as the novel adaptation and co-design of these principles for the unique traffic patterns of non-uniform multi-GPU interconnects.
Understated Hardware Complexity: The paper argues for low overhead, citing ~16KB of SRAM. While the storage overhead is indeed small, the logical complexity added to the network switch is non-trivial. The NetCrafter controller (Figure 13, page 8) requires queue parsing, candidate searching for stitching, timers for pooling, and prioritization logic. This could potentially impact the switch's pipeline depth and critical path latency, a factor that is abstracted away by the fixed 30-cycle latency assumption.
Limited Scope Regarding Coherence: The work assumes a software-managed coherence model, which is common today. However, the field is moving towards hardware-coherent multi-GPU systems. Such systems introduce new classes of small, latency-critical traffic (e.g., invalidations, acknowledgments, probes). It is a missed opportunity to not discuss how NetCrafter's mechanisms could be extended to manage this coherence traffic, which would be a perfect candidate for both Stitching and Sequencing.

Questions to Address In Rebuttal

The evaluation assumes the NetCrafter logic fits within the baseline 30-cycle switch latency. Can the authors provide more reasoning on the feasibility of this? Specifically, how does the search for a stitching candidate within the Flit Pooling mechanism avoid extending the critical path of the switch pipeline?
The work is situated within a software-coherent memory model. How do the authors envision NetCrafter adapting to a future hardware-coherent multi-GPU system? Would coherence messages be treated as a new, high-priority traffic class for Sequencing, and would they be good candidates for Stitching with other control packets?
Could the authors comment on the potential for negative interactions between the mechanisms? For instance, does the Trimming mechanism, by creating smaller response packets, reduce the opportunity for the Stitching mechanism to find "parent flits" with enough empty space to be useful? Or do they primarily act on different types of traffic, minimizing interference?

Review 3

Review Form: The Innovator

Summary

The paper proposes NetCrafter, a system designed to optimize network traffic in multi-GPU systems characterized by non-uniform interconnect bandwidth. The authors identify that traffic on the slower, inter-cluster links is a primary performance bottleneck. To address this, NetCrafter employs a combination of three techniques: 1) Stitching, which combines partially filled flits to improve link utilization; 2) Trimming, which fetches only the necessary portions of a cache line (sub-blocks) for requests traversing the slow links; and 3) Sequencing, which prioritizes latency-critical page table walk (PTW) traffic over bulk data traffic. The authors claim this combination of techniques is novel and results in significant performance improvements, averaging 16% across their evaluated workloads.

Strengths

The paper addresses a timely and practical problem. As multi-GPU systems scale using hierarchical and non-uniform networks (e.g., Frontier, Aurora), managing the traffic on lower-bandwidth links is of paramount importance. The authors correctly identify key sources of network inefficiency. The evaluation appears thorough, covering a diverse set of applications and performing sensitivity studies.

Weaknesses

My primary concern, and the focus of this review, is the fundamental novelty of the proposed techniques. While the authors present NetCrafter as a combination of "novel approaches," the core concepts underlying each of the three pillars—Stitching, Trimming, and Sequencing—are well-established principles in computer architecture and networking.

Stitching is functionally equivalent to packet/message coalescing. The idea of aggregating smaller data units into a larger transmission unit to amortize header overhead and improve link utilization is not new.
- The paper's own Related Work section (Section 6, Page 12) acknowledges similar concepts like TCP piggybacking [79, 88] and batching in NICs [41, 80].
- More directly, Finepack [62] proposed dynamic coalescing of small writes/stores in a multi-GPU context. The authors of NetCrafter claim their approach is more general by operating at the flit level for various packet types. However, this appears to be an incremental extension of the same core concept rather than a fundamentally new idea.
- The proposed "Flit Pooling" mechanism, which delays a flit to find a stitching partner, is conceptually identical to techniques like interrupt coalescing/moderation in network cards [67, 89], where an event is delayed to allow for batch processing.
Trimming is a direct application of sectored/sub-blocked caches. The mechanism of fetching only a portion of a cache line is a classic architectural technique, known as sub-blocking or sectoring, designed to reduce memory bandwidth consumption at the cost of more fine-grained metadata.
- The authors acknowledge the concept of sectored caches [35, 74] in Section 4.3 (Page 7).
- The only "novel" aspect presented is the policy of applying this technique selectively, i.e., only for requests traversing the slower inter-GPU-cluster links. While this is a sensible engineering decision to balance bandwidth savings against potential harm to spatial locality, a new application policy for an old mechanism does not, in my view, constitute a novel technical contribution for a premier architecture conference.
Sequencing is a standard application of Quality of Service (QoS). Prioritizing latency-sensitive control or metadata traffic over bulk data traffic is a foundational concept in network design, often implemented using virtual channels or priority queues.
- The insight that PTW traffic is latency-critical is also well-documented in prior work on GPU virtual memory, as noted by the authors' own citations [7, 28, 44, 81] in the Introduction (Section 1, Page 1).
- Combining a known problem (PTW latency) with a standard solution (traffic prioritization) is sound engineering but lacks research novelty.

In summary, the paper's contribution appears to be the system-level integration and specific application of three pre-existing optimization principles to the multi-GPU interconnect problem. While the engineering and evaluation of this integrated system are valuable, the work does not introduce a fundamentally new architectural concept. The performance gains of 16% on average, while respectable, are not so transformative as to justify the claims of novelty for what is essentially a clever recombination of known techniques.

Questions to Address In Rebuttal

The authors should use the rebuttal to convince the program committee of the work's novelty by addressing the following points directly:

On Stitching: Beyond applying the idea to more packet types, what is the fundamental conceptual difference between flit-level Stitching and prior art in message/packet coalescing (e.g., Finepack [62])? How is "Flit Pooling" conceptually novel compared to decades of batching/delay mechanisms used in networking hardware to improve efficiency?
On Trimming: The underlying mechanism for Trimming is a sectored cache fetch. Do the authors contend that the novelty lies exclusively in the policy of when to apply it? If so, please argue why a new application policy for a decades-old mechanism is a significant enough contribution for ISCA.
On the Combination: The primary argument for novelty may rest on the synergistic combination of these three techniques. Please elaborate on any non-obvious, emergent benefits that arise from combining these three specific techniques that would not be achieved by implementing them independently. Are there specific interactions between Stitching, Trimming, and Sequencing that create a "whole is greater than the sum of its parts" effect?

ANSMET: Approximate Nearest Neighbor Search with Near-Memory Processing and Hybrid Early Termination

Abstract

Approximate nearest neighbor search (ANNS) is a fundamental operation in modern vector databases to efficiently retrieve nearby vectors to a given query. On general-purpose computing platforms, ANNS is found not only to be highly memory-bound due to the ...

ACM LINK

Reviews

Review 1

Excellent. Here is a peer review of the paper from the perspective of "The Guardian."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present ANSMET, a DIMM-based near-data processing (NDP) system for accelerating Approximate Nearest Neighbor Search (ANNS). The core proposal is a "hybrid partial-dimension/bit early termination" (ET) algorithm, which aims to reduce memory traffic by calculating a lower-bound distance on partially fetched vectors and terminating further accesses if this bound exceeds a threshold. This ET scheme is supported by a sampling-based offline process to optimize the data layout for fetch efficiency. While the paper addresses a relevant problem, the central claims rest on a heuristic methodology whose robustness is not sufficiently established. Furthermore, the headline claim of "no accuracy loss" is conditional and potentially misleading, obscuring significant performance and storage overheads required to maintain it.

Strengths

The paper correctly identifies that a large fraction of distance computations in ANNS are for "rejected" vectors, providing a clear motivation for an early termination strategy.
The evaluation is conducted across a reasonable set of seven diverse, large-scale datasets, which provides breadth to the experimental results.
The proposed hardware architecture is a pragmatic extension of existing DIMM-based NDP concepts, and the breakdown of system components is clearly described.

Weaknesses

The "No Accuracy Loss" Claim is Misleading. The central premise of the early termination algorithm is that it guarantees no accuracy loss. However, a deep reading of Section 4.2 ("Offline common prefix elimination") and Table 5 reveals this is not strictly true under all configurations.
- To achieve no accuracy loss for outlier vectors (those not matching the common prefix), the system must store the original, non-compressed vector separately and perform a "re-check." This incurs both storage overhead and, more critically, additional memory accesses (Table 5a shows up to 1.4% extra accesses for the default 0.1% outlier threshold). This performance penalty is not adequately factored into the main performance results and undermines the efficiency claims.
- The alternative, which is implied to be more efficient, involves dropping the least significant bits of outlier elements (Figure 4c), which by the authors' own admission leads to significant accuracy loss (Table 5b shows a catastrophic -34.7% accuracy drop). The paper cannot simultaneously claim "no accuracy loss" as a general feature while also presenting and depending on configurations that either lose accuracy or incur non-trivial overhead to preserve it.
The Sampling-Based Data Layout Optimization is Heuristic and Lacks Rigor. The entire data layout, which is critical to the performance of the hybrid ET, is determined by an offline sampling of just 100 vectors from billion-scale datasets (Section 4.2). This methodology is highly suspect.
- The choice of parameters (nc, nF, TC) is entirely dependent on this minuscule sample being perfectly representative of both the global data distribution and, implicitly, the future query distribution. There is no analysis to demonstrate that this holds. What is the performance degradation if the sample is not representative?
- The use of the 90th percentile distance from the sample as the reference threshold is arbitrary. Figure 11b shows this percentile yields the lowest KL divergence for the DEEP dataset, but there is no theoretical justification or evidence that this specific value is optimal or even suitable for other datasets. The methodology appears to be curve-fitting on a small sample.
The Absolute Improvement in Fetch Utilization is Underwhelming. Figure 10 is presented as a success, showing that ANSMET (NDP-ETOpt) improves effectual data fetch utilization to 11.1% from a baseline of 6.0%. While this is a relative improvement, an 11.1% utilization rate is still exceptionally low. This suggests that even with the proposed complex optimizations, nearly 90% of all data fetched from DRAM remains useless. This fundamentally questions the overall efficacy of the approach; it makes an extremely inefficient process slightly less inefficient, rather than solving the core problem.
Baseline Comparisons May Not Be Robust. The paper compares against NDP-DimET as its representation of prior art. However, it is not clear if this baseline is an optimized implementation of dimension-level ET or a strawman. State-of-the-art dimension-level ET techniques can be more sophisticated than what is implied here. Without a more detailed specification of this baseline, the claimed 1.52x speedup of the hybrid ET is not verifiable.

Questions to Address In Rebuttal

Please clarify the "no accuracy loss" claim. For the mode that guarantees accuracy, provide a rigorous analysis of the performance and storage overhead of the re-check mechanism across all datasets, not just SPACEV. The main results in Figure 6 should be updated to reflect this overhead.
Justify the sampling methodology. Provide a sensitivity analysis showing how the final system performance (QPS) changes with (a) the number of vectors in the sample (e.g., 10, 100, 1000, 10000) and (b) the chosen distance threshold percentile (e.g., 70%, 80%, 90%, 95%). This is critical to prove the method is robust and not just a fragile heuristic.
Provide a more detailed specification of the NDP-DimET baseline. What specific algorithm from prior work ([25, 69, 86]) does it implement? How was it optimized to ensure a fair comparison against your proposed hybrid technique?
The final fetch utilization rate is still very low (~11%). Can the authors comment on the fundamental reasons for this remaining inefficiency? Does this suggest a ceiling on the effectiveness of any early termination strategy that operates at this level of granularity?

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents ANSMET, a hardware-software co-design that accelerates Approximate Nearest Neighbor Search (ANNS) by tackling its primary bottlenecks: memory bandwidth and wasted computation. The core contribution is the synthesis of two key ideas: (1) a practical DIMM-based Near-Data Processing (NDP) architecture to offload distance calculations, and (2) a novel, lossless "hybrid early termination" algorithm. This algorithm computes a conservative lower bound of the distance to a query vector by incrementally fetching and processing partial data—either a subset of dimensions or, more uniquely, a subset of the most significant bits within each dimension. This enables the system to terminate work on unpromising vectors much earlier, often before the full data has even been fetched from DRAM. The authors support this co-design with a systematic, sampling-based methodology for optimizing the in-memory data layout to maximize the effectiveness of early termination. The work is evaluated in a cycle-accurate simulation environment, demonstrating significant speedups on billion-scale datasets.

Strengths

The true strength of this paper lies in its holistic, problem-driven approach and its excellent positioning within the broader landscape of systems and algorithm research.

Elegant Problem-Solution Coupling: The authors correctly identify the fundamental inefficiency in ANNS workloads: a large fraction of memory accesses and computations are for vectors that are ultimately rejected (as shown clearly in Figure 1, page 3). Their proposed hybrid early termination is not just a clever trick; it is an elegant solution directly tailored to this specific problem. By stopping data movement and computation at the earliest possible moment, it addresses the inefficiency at its root.
A Novel Synthesis of Algorithmic Primitives: The concept of early termination is not new, nor is the idea of bit-serial processing. However, the authors' contribution is to create a powerful synthesis. They combine dimension-level termination (a known technique) with a novel bit-level termination, creating a hybrid approach. More importantly, they build the necessary system infrastructure around it, particularly the data layout optimization (Section 4.2, page 5), which transforms the raw algorithm into a practical systems solution. The use of prefix entropy to guide a dual-granularity fetch strategy is particularly insightful.
Pragmatic Hardware Choices and Contextual Awareness: The choice of a DIMM-based NDP architecture, as opposed to a more exotic 3D-stacked memory approach, is a sign of mature systems thinking. The authors understand that billion-scale vector databases require capacity above all else, a domain where DIMMs excel. They correctly place their work in conversation with the rich body of literature on DIMM-based NDP for recommendation systems (e.g., RecNMP, TensorDIMM), noting the architectural similarities but highlighting the unique algorithmic opportunities that ANNS presents.
Connecting Algorithm Theory to Hardware Reality: This work serves as an excellent case study in how abstract algorithms can be re-imagined in light of concrete hardware characteristics. The 64-byte granularity of a DRAM access is no longer just a parameter to be tolerated, but a constraint to be exploited. The entire data layout optimization scheme is built around making every 64-byte transfer as informative as possible for the early termination heuristic. This is a wonderful example of true hardware-software co-design.

Weaknesses

The weaknesses identified are less about fundamental flaws and more about the scope and positioning of the work, which could be expanded to clarify its broader applicability.

Focus on a Single Indexing Family: The evaluation and discussion are heavily centered on HNSW, a graph-based index. While HNSW is state-of-the-art and highly representative, ANNS is a diverse field. It would strengthen the paper to discuss in more detail how the ANSMET approach would apply to other major index families, such as inverted file (IVF) based methods. The access patterns are different (more sequential scans within inverted lists), which might change the dynamics of early termination and load balancing across NDP units.
Implications for Dynamic Databases: The proposed data layout transformation is an offline preprocessing step. This is perfectly reasonable for static datasets. However, a key feature of modern vector databases is support for incremental insertions and updates. A discussion on the practicality and overhead of applying these transformations in a dynamic environment would be valuable. Does a new vector need to be fully transformed before it's searchable? How does this impact ingestion latency?
Under-explored Interaction with Vector Quantization: The paper briefly mentions compatibility with vector quantization (Section 4.3, page 7), but this interaction is critical. Product Quantization (PQ) and its variants are the industry standard for reducing the memory footprint of billion-scale indexes. Quantization fundamentally alters the bit-level representation and error profile of the data. It is not immediately clear how much benefit the bit-level termination provides on top of data that has already been compressed to 8-bit integers. A more detailed analysis or evaluation of this interplay would significantly increase the paper's practical impact.

Questions to Address In Rebuttal

Beyond HNSW, could the authors elaborate on the expected performance and potential challenges of applying ANSMET to an IVF-based index? Specifically, how would the more structured memory access patterns of scanning an inverted list interact with the proposed hybrid partitioning and adaptive polling schemes?
Regarding the offline data layout optimization: What is the authors' vision for integrating this approach into a dynamic vector database that handles frequent insertions? Would the overhead of the bit-level transformations on newly inserted vectors negate some of the search-time benefits?
Could the authors provide more intuition or preliminary data on the effectiveness of hybrid early termination when applied to vectors that have already undergone Product Quantization (PQ)? Does the common prefix elimination technique still yield significant benefits on the quantized codes, or does the benefit shift primarily to the partial dimension aspect of the technique?
The current design cleanly separates index traversal (CPU) from distance comparison (NDP). This is a sensible division of labor. Looking forward, do the authors see any potential in offloading parts of the index traversal itself, such as fetching the neighbor lists in HNSW, to the NDP units to further reduce CPU-memory traffic?

Review 3

Of course. Here is the peer review from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper, "ANSMET," presents a hardware-software co-design to accelerate Approximate Nearest Neighbor Search (ANNS). The authors propose integrating a DIMM-based Near-Data Processing (NDP) architecture with a novel early termination (ET) strategy. The core claim of novelty resides in this ET strategy, which is a hybrid of partial-dimension and partial-bit evaluation. By fetching and processing a vector piece-by-piece (in terms of both dimensions and the bits within each dimension’s representation), the system computes a conservative lower bound on the distance. If this bound exceeds the current threshold, subsequent memory accesses and computations for that vector are terminated, saving bandwidth and compute cycles without accuracy loss. This core idea is supported by a sampling-based offline method to optimize data layouts for the ET strategy and a heterogeneous execution model where the host CPU handles index traversal and the NDP units handle distance calculations.

Strengths

Genuinely Novel Algorithmic Contribution: The central contribution—the hybrid partial-dimension/bit early termination—is a novel concept in the ANNS acceleration space. Prior work on ET has largely focused on either the vector level [52] or the dimension level [25, 69]. While bit-level ET has been explored in BitNN [32] for low-dimensional point clouds, that work was confined to bit-serial processing. ANSMET proposes a more generalized and flexible framework that combines dimension- and bit-level termination and, crucially, operates on variable-sized chunks of bits rather than serially. The "dual-granularity fetch" mechanism detailed in Section 4.2 (page 6) is a direct and novel outcome of this core idea.
Strong Co-Design Principle: The paper successfully demonstrates a tight coupling between the novel algorithm and the system design. The algorithmic need for non-contiguous, prioritized data access (e.g., most significant bits first) directly motivates the non-trivial data re-layout scheme. The proposed sampling-based offline optimization to determine the fetch granularity (nc, nF) and common prefix length is a clever and novel engineering solution that makes the core algorithmic idea practical. This full-stack thinking, from algorithm to data layout to hardware, is a significant strength.
Novel Application of Existing Concepts: The use of "offline common prefix elimination" (Section 4.2, page 6) is a creative application of a well-known compression technique to the problem of ANNS. Identifying and omitting low-entropy prefixes in vector elements to reduce data movement is a novel approach in this specific context.

Weaknesses

Architectural Foundation is Not Fundamentally New: While the paper claims to be the first DIMM-based NDP system for ANNS, the underlying hardware architecture is an adaptation of prior art. DIMM-based NDP systems with buffer-chip logic for accelerating high-dimensional vector operations are well-established in the context of recommendation systems [42, 47, 82]. The hardware described in Section 5 (page 7), including the QSHRs and parallel compute units, is conceptually very similar to these previous designs. The novelty is therefore not in the base architecture, but in its specific tailoring and co-design for the proposed ET algorithm. The paper should be more explicit about this distinction.
Complexity Introduced by Offline Analysis: The entire system's effectiveness hinges on a complex offline preprocessing and data transformation stage. This stage requires statistical analysis of a data sample (e.g., prefix entropy, early termination frequency as shown in Figure 3, page 5) to determine optimal layout parameters. While the authors argue the cost is amortized, this dependency introduces a significant novelty-vs-practicality tradeoff. For highly dynamic vector databases with frequent insertions and updates, the novelty of this static optimization may become a practical bottleneck, as the learned layout could become stale.
The "Delta" Over Prior Art May Be Data-Dependent: The performance gains from the most novel components (dual-granularity fetch and prefix elimination) appear to be highly dependent on the statistical properties of the dataset. For instance, if a dataset's vector elements have high-entropy prefixes, the benefit of common prefix elimination would be negligible. Similarly, if the early termination probability is not concentrated in a specific bit-range, the dual-granularity fetch may offer little advantage over a simpler, uniform-chunk approach. The novelty is sound, but its significance might not be universal across all possible high-dimensional vector distributions.

Questions to Address In Rebuttal

Regarding the hardware architecture, can the authors precisely delineate what aspects of their NDP unit design (Figure 5, page 7) are fundamentally novel, beyond the parameterization required for the hybrid ET algorithm? How does the required logic differ in principle from what was proposed in NDP systems for recommendation models like RecNMP [42] or TensorDIMM [47]?
The novelty of the system relies heavily on offline analysis and a static data layout. How does the proposed ANSMET framework handle dynamic vector databases where new vectors are continuously inserted? Would this require periodic, system-wide re-analysis and data re-shuffling, potentially negating the online performance benefits?
The core idea of prioritizing more significant bits is intuitive. However, could the authors comment on the robustness of their sampling-based parameter tuning? What is the performance degradation if the sampling set is not representative of the true query distribution, leading to suboptimal choices for the prefix length or the coarse/fine-grained bit steps (nc, nF)? This would help assess the true "delta" of this novel contribution in real-world, unpredictable scenarios.

DReX: Accurate and Scalable Dense Retrieval Acceleration via Algorithmic-Hardware Codesign

Abstract

Retrieval- augmented generation (RAG) supplements large language models (LLM) with information retrieval to ensure up-to-date, accurate, factually grounded, and contextually relevant outputs. RAG implementations often employ dense retrieval methods and ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes DReX, an algorithmic-hardware co-design for accelerating dense vector retrieval, primarily for Retrieval-Augmented Generation (RAG) applications. The core idea is a two-stage process: 1) an in-DRAM filtering mechanism called Sign Concordance Filtering (SCF) that uses simple logic (PIM Filtering Units or PFUs) to prune the search space by comparing only the sign bits of query and corpus vectors, and 2) a near-memory accelerator (NMA) that performs an exact nearest neighbor search on the much smaller, filtered set of vectors. The system is architected as a CXL Type-3 device using LPDDR5X memory. The authors claim that DReX is dataset-agnostic, accurate, and significantly outperforms state-of-the-art ANNS methods on both CPU and GPU platforms.

Strengths

Clear Co-design Philosophy: The work presents a clear and compelling case for an algorithmic-hardware co-design approach. The tight integration of the SCF algorithm with the in-DRAM PFU logic and the specialized data layout (Section 5.2, page 6) is a well-considered piece of engineering.
Detailed Architecture: The proposed hardware architecture is detailed and appears plausible. The choice of LPDDR5X over HBM is well-justified based on capacity and shoreline pin limitations (Section 5.4, page 7). The distribution of NMAs per memory package is a sound design choice for scalability.
Inclusion of Ablation Study: The ablation study in Section 7.2 (page 12) is valuable. It effectively isolates the performance contributions of the near-memory exact search component (N/A--->NMAs) versus the full DReX system with in-memory filtering (PFUs--->NMAs), providing a clearer picture of where the speedups originate.

Weaknesses

Fundamental Flaw in the "Dataset-Agnostic" Claim: The central claim that DReX is "dataset-agnostic" (Abstract, page 1) is not only unsubstantiated but is directly contradicted by the authors' own analysis. The entire premise of Sign Concordance Filtering relies on the assumption that vector distributions are centered around zero, making similarity correlatable with sign agreement. The authors admit this reliance in Section 3 (page 4), stating "many of these embedding vectors demonstrate distributions... centered on or near zero." This is a strong assumption, not a general property. More damningly, the discussion in Section 8 and Figure 18 (page 14) explicitly demonstrates the catastrophic failure of SCF on a "pathologically constructed dataset" (i.e., a non-negative dataset), where filtering performance becomes worse than random. The proposed solution, Iterative Quantization (ITQ), is an admission that the core SCF algorithm is, in fact, highly dataset-dependent and requires a preprocessing step to enforce the necessary data properties. This preprocessing overhead is not evaluated, and its necessity fundamentally undermines the paper's core premise of generality.
Misleading Performance Comparisons: The performance comparisons in Figure 11 (page 10) are presented as an algorithm-to-algorithm showdown, but they are fundamentally a platform-to-platform comparison. The DReX system is a custom-designed accelerator with an enormous internal memory bandwidth of 1.1 TB/s (Table 2, page 9). It is compared against general-purpose CPUs with 282 GB/s and GPUs with 3.35 TB/s of memory bandwidth. While the GPU has higher peak bandwidth, HNSW and other graph-based methods exhibit irregular access patterns that fail to saturate it, whereas DReX's design is tailored for sequential streaming. The massive speedups reported (e.g., 270x over CPU IVF-SQ) are therefore more indicative of the benefits of specialized high-bandwidth hardware for brute-force computation than the superiority of SCF over ANNS. A fair comparison would require acknowledging that the baselines are severely bandwidth-bottlenecked on their respective platforms.
Insufficiently Rigorous Competitor Evaluation: The comparison against the ANNA accelerator (Section 7.1.3, page 11) is based on a "first-order model" constructed by the authors (Section 6, page 9). Comparing a detailed simulation of a proposed architecture against a high-level analytical model of a competing architecture is not a rigorous or convincing evaluation. This approach is susceptible to modeling errors and optimistic assumptions that could unfairly favor the authors' proposal.
Questionable Power and Area Modeling: The power analysis in Section 7.4 (page 13) relies on applying a power breakdown model from an HBM paper (Lee et al. [40]) to their LPDDR5X-based system. The authors must provide justification for why power characteristics of these two very different memory technologies can be considered analogous. Furthermore, the PFU area overhead is calculated based on a synthesis in a 16nm logic process and then scaled, with an assumed 10x area penalty for a DRAM process (Section 6, page 8). This is a rough estimation, and the actual implementation costs of integrating non-trivial logic into a cutting-edge DRAM periphery could be substantially higher. The reported "modest" overheads rest on these fragile assumptions.

Questions to Address In Rebuttal

Please reconcile the claim of your method being "dataset-agnostic" with the evidence presented in Section 8 that SCF fails completely on non-zero-centered data. If the solution is to use ITQ, please provide a full evaluation of the computational and storage overhead of this mandatory preprocessing step and incorporate it into your end-to-end performance results.
Can you justify the fairness of comparing your specialized, high-bandwidth hardware platform against general-purpose CPUs and GPUs? Please provide an analysis that decouples the gains from the SCF algorithm itself versus the gains from having a massive, dedicated memory bandwidth for the exact search phase. For instance, what is the performance if the baseline ANNS algorithms were run on a hypothetical platform with equivalent memory bandwidth to DReX?
Please defend the decision to compare DReX against a "first-order model" of the ANNA accelerator rather than a more rigorous, published simulation framework or implementation. What steps were taken to validate that your model accurately represents the performance and bottlenecks of the ANNA architecture?
Provide a more robust justification for your power and area modeling. Specifically, why is it valid to use an HBM power breakdown for an LPDDR5X system? What evidence supports the 0.1 mm² area for a PFU implemented in a real DRAM process, beyond the high-level 10x penalty assumption?

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses a critical and timely bottleneck in modern AI systems: the performance of dense vector retrieval for Retrieval-Augmented Generation (RAG). The authors correctly identify the problematic trade-off between slow but accurate Exact Nearest Neighbor Search (ENNS) and fast but often inaccurate and dataset-specific Approximate Nearest Neighbor Search (ANNS).

The core contribution is DReX, a compelling algorithmic-hardware co-design that aims to deliver the accuracy of ENNS with performance surpassing ANNS. The proposal is built on two key ideas: 1. An elegant and computationally inexpensive algorithm called Sign Concordance Filtering (SCF), which uses the sign bits of vector dimensions to perform a high-throughput, online filtering of a vector database. 2. A hierarchical hardware architecture that implements this co-design, featuring in-DRAM PIM Filtering Units (PFUs) to execute SCF with massive parallelism, and near-memory accelerators (NMAs) to perform an exact similarity search on the small, filtered set of candidate vectors.

The authors present a holistic system design, including specific DRAM data layouts and a CXL-based integration strategy. Their evaluation, culminating in a 6.2-7x reduction in time-to-first-token for a representative RAG application, convincingly demonstrates the system's potential.

Strengths

Excellent Problem-Solution Fit: The paper targets a high-impact, real-world problem. The retrieval step is a well-known performance and quality limiter for RAG. The proposed solution is not merely an acceleration of an existing algorithm but a ground-up rethinking of the problem from a co-design perspective, which is precisely the right approach.
Elegance of the Core Algorithm: The Sign Concordance Filtering (SCF) method (Section 4, page 4) is the paper's conceptual jewel. It is simple, intuitive, and, most importantly, exceptionally well-suited for hardware implementation (requiring little more than bitwise XORs and popcounts). This avoids the complexity of trying to implement something like graph traversal (HNSW) in hardware and instead creates an algorithm that thrives on the vast, simple parallelism available in DRAM.
Holistic and Credible System Design: This is a strong systems paper. The authors have considered the full stack, from the algorithm down to the data layout in memory (Section 5.2, page 6), the PIM logic in the DRAM periphery (Section 5.3, page 6), the near-memory compute unit (Section 5.4, page 7), and the system integration via CXL. The justification for using LPDDR5X over HBM is well-reasoned and adds to the design's credibility. This end-to-end thinking is a significant strength.
Connecting Architectural Gains to Application-Level Impact: A major strength of the evaluation is the direct line drawn from the retrieval throughput improvements (Figure 11, page 10) to the reduction in application-level time-to-first-token (TTFT) (Section 7.3, Figure 15, page 12). This is often missing in architecture papers, which can get lost in micro-benchmarks. By showing a tangible benefit to the end-user of an LLM application, the authors make a powerful case for their work's significance.
Anticipation of Limitations: The discussion in Section 8 (page 14) proactively addresses the most obvious critique of SCF: its dependency on data distributions being centered around zero. By showing that a standard technique like Iterative Quantization (ITQ) can effectively mitigate this pathological case, the authors substantially strengthen their claims of generality and robustness.

Weaknesses

While this is an excellent paper, its positioning within the broader landscape of in-memory processing could be strengthened.

Context within the Broader History of In-Memory Filtering: The architectural pattern of using simple in-memory logic for coarse-grained filtering followed by more powerful near-memory processing for fine-grained evaluation is a classic idea in the database accelerator and processing-in-memory communities. While the application to dense vectors for RAG is novel and the specific SCF algorithm is new, the paper would benefit from acknowledging this lineage. Placing DReX as the latest and most sophisticated application of this long-standing pattern would strengthen its academic context rather than weaken its novelty.
Limited Exploration of the Algorithmic Design Space for In-DRAM Filtering: SCF is presented as the primary solution, and it is a very good one. However, the paper could be even more impactful by briefly discussing why SCF is the right choice compared to other potential hardware-friendly, online filters. For instance, were other simple primitives considered, such as filtering based on a few key quantized dimensions, or a simple form of Locality-Sensitive Hashing (LSH)? A short discussion justifying the choice of SCF over these alternatives would add depth.
The Economic and Practical Viability of Modified DRAM: The proposed system relies on custom logic within the DRAM die (the PFUs). This is a notoriously high bar for adoption in the industry. While the power and area analysis (Section 7.4, page 13) is good, the paper could benefit from a paragraph discussing the path to adoption. Is this a feature that could be standardized by JEDEC? Could it be a high-margin custom product for a specific hyperscaler? Acknowledging this practical hurdle and suggesting a path forward would make the work more complete.

Questions to Address In Rebuttal

The success of this work is predicated on the elegance of the SCF algorithm. Did the authors consider or prototype other simple, hardware-amenable online filtering techniques (e.g., based on quantized centroids, or a subset of vector dimensions)? A brief discussion on why SCF was chosen over potential alternatives would be enlightening.
Can the authors further contextualize their contribution in relation to the broader history of in-memory/near-data filtering for search and database applications? While the application to RAG is new, the architectural pattern feels familiar. Acknowledging this and highlighting what makes the DReX design uniquely suited for high-dimensional vectors would strengthen the paper's positioning.
The proposed system requires significant modifications to the memory subsystem (custom DRAM and NMAs). Could the authors comment on the cost-benefit trade-off from a Total Cost of Ownership (TCO) perspective? How does the cost of a DReX system compare to a baseline system that achieves similar performance by simply scaling out more commodity servers with GPUs running an optimized ANNS library?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents DReX, an algorithmic-hardware co-design for accelerating dense vector retrieval, primarily for Retrieval-Augmented Generation (RAG) applications. The authors identify the core novelty as a two-stage process: 1) A computationally lightweight, online filtering algorithm called Sign Concordance Filtering (SCF), which is implemented using in-DRAM Processing-in-Memory (PIM) logic (PFUs). This stage prunes the vast majority of non-candidate vectors without moving them off-chip. 2) An exact nearest neighbor search (ENNS) performed on the much smaller, filtered set of vectors using a near-memory accelerator (NMA).

The central novel claim is not the algorithm or the hardware in isolation, but their tight co-design. Specifically, the identification of an algorithm (SCF) that is simple enough for efficient PIM implementation yet effective enough to drastically reduce the search space for a subsequent, more complex near-memory processing stage.

Strengths

Novelty of the Co-Design: The primary strength of this work is the symbiotic relationship between the proposed algorithm and hardware. While PIM and near-memory accelerators for search are not new concepts, the authors have identified a particularly elegant primitive (Sign Concordance Filtering) that is exceptionally well-suited for in-DRAM implementation (bitwise XOR and accumulation). This avoids the complexity of implementing more sophisticated indexing or hashing schemes in memory. This specific co-design appears to be novel.
Online vs. Offline Filtering: The proposed SCF is an online filter, distinguishing it from the dominant paradigm of offline index construction used in ANNS methods like HNSW and IVF. This is a significant conceptual departure. By avoiding a rigid, pre-computed index structure, DReX offers flexibility for dynamic datasets where vectors are frequently added or updated, a point briefly touched upon in Section 8 (page 14). This online nature is a key part of its novelty.
Clear Delta from Prior Art: The authors build upon existing work, including what appears to be their own (IKS [61]), which focused on near-memory ENNS acceleration. The novelty of DReX is clearly articulated as the addition of the in-DRAM SCF pre-filtering stage (the PFUs). The paper effectively demonstrates that this new component is responsible for the majority of the performance gain over a pure near-memory ENNS accelerator. This represents a significant and non-obvious extension of prior art.

Weaknesses

Algorithmic Proximity to LSH: The core SCF mechanism, while presented as a novel heuristic, bears a strong conceptual resemblance to certain variants of Locality Sensitive Hashing (LSH), particularly SimHash, where the sign of dot products with random vectors forms a hash. SCF uses the signs of the vector's own components, which can be viewed as dot products with axis-aligned basis vectors. While functionally distinct (no random projections), the underlying principle of using sign agreement as a proxy for angular similarity is a well-established concept. The paper would be stronger if it more directly confronted this similarity and provided a clearer analysis of why its direct, axis-aligned approach is superior to a PIM-implemented LSH scheme for this specific problem.
Limited Novelty of the NMA Component: The paper acknowledges in Section 7.1 (page 10) that the ENNS-only configuration is "equivalent to IKS [61]". This implies the near-memory accelerator (NMA) architecture for the second stage is not, in itself, a novel contribution of this work. The novelty is therefore confined to the PFU filtering logic and the system-level pipeline connecting the PFU and NMA. This should be stated more explicitly upfront in the architectural description (Section 5) to precisely frame the paper's contribution.
Generality of the Core Mechanism: The effectiveness of SCF is predicated on the assumption that vector embeddings are distributed somewhat symmetrically around zero. The authors rightly identify this limitation in Section 8 (page 14) and show a pathological case where performance collapses. While they propose a known technique (ITQ) as a remedy, this highlights that the core novel mechanism is not universally applicable without data pre-processing. The novelty is therefore in a mechanism that works exceptionally well for a specific, albeit common, data distribution.

Questions to Address In Rebuttal

Please clarify the novelty of Sign Concordance Filtering (SCF) with respect to established LSH techniques. Could you provide a brief theoretical or empirical comparison against a PIM-amenable LSH scheme (e.g., a single-table SimHash)? What are the specific trade-offs in terms of hardware complexity, filtering efficacy, and recall that make SCF a superior choice for this co-design?
The paper discusses a pathological case (Figure 18, page 14) requiring ITQ pre-processing. In a dynamic RAG environment where new documents are constantly added, what is the overhead of applying or updating an ITQ rotation to the corpus? Does this requirement for a global data transformation compromise the "simple update" story that is a key advantage of DReX over ANNS methods?
Could the authors be more precise about the novel contributions within the Near-Memory Accelerator (NMA) architecture itself? Beyond leveraging the design from IKS [61], are there any specific modifications or optimizations made to the NMA to better handle the sparse and temporally unpredictable stream of candidate vectors produced by the PFU stage? Or is the novelty purely in the addition of the PIM front-end?

EOD: Enabling Low Latency GNN Inference via Near-Memory Concatenate Aggregation

Abstract

As online services based on graph databases increasingly integrate with machine learning, serving low-latency Graph Neural Network (GNN) inference for individual requests has become a critical challenge. Real-time GNN inference services operate in an ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents EOD, a co-designed hardware/software system for low-latency Graph Neural Network (GNN) inference in an inductive setting. The core idea is to mitigate the latency caused by neighborhood explosion and data preparation by precomputing hidden features for all training nodes. To manage the resulting memory overhead, the authors propose a "multi-layer concatenate" compression scheme (ZVC). These algorithmic changes are supported by a custom DIMM-based Near-Memory Processing (NMP) architecture designed for efficient aggregation of the precomputed, compressed features.

While the paper identifies a valid and important problem, the proposed solution rests on a foundation of precomputation that introduces critical methodological flaws. The evaluation framework compares fundamentally different amounts of online work, leading to inflated performance claims, and fails to account for the substantial hidden costs required to maintain the system's accuracy, thereby undermining its claimed real-world viability.

Strengths

Clear Problem Identification: The paper correctly identifies that the Preparation and Memcpy steps, rather than pure computation, are the primary bottlenecks for real-time GNN inference services. The latency breakdown in Figure 3(a) effectively motivates the need to address these stages.
Novel Compression Heuristic: The observation of an inverse relationship in sparsity between consecutive GNN layers (Figure 5) and the proposed multi-layer concatenation method to exploit this for ZVC compression (Section 4.2) is a clever algorithmic insight. This technique effectively increases the minimum number of zero-values per node, enhancing the efficacy of the chosen compression scheme.

Weaknesses

Fundamental Evaluation Flaw: Unaccounted Maintenance Cost: The entire premise of precomputation introduces data staleness. As new nodes and edges are added during online service, the precomputed embeddings for training nodes become outdated. The authors acknowledge this and show the resulting accuracy degradation in Figure 15. Their proposed solution is a "periodic re-precomputation (refresh)." This refresh operation is computationally equivalent to performing GNN inference on the entire training graph, a massively expensive and high-latency process. This cost is completely omitted from the performance evaluation. A system that requires periodic high-latency downtime or background computation to remain accurate cannot be fairly evaluated on its low-latency inference capabilities alone. This omission is a critical flaw that invalidates the claims of providing a practical low-latency solution.
Misleading Baseline Comparison: The headline speedup claims (e.g., 17.9x geometric mean end-to-end) are derived from comparing the proposed EOD system against a standard GPU baseline. This is an apples-to-oranges comparison. The GPU baseline performs the full L-hop neighborhood traversal and feature gathering online (Preparation step), while EOD offloads this expensive traversal to an offline precomputation step. EOD's online workload is fundamentally smaller and simpler. A fair comparison would be against a GPU baseline that also leverages precomputed data (the "GPUpre" case in Figure 14). As the paper itself states on page 12, the speedup of EOD over GPUpre is a much more modest 1.14-1.35×. The massive reported speedups are an artifact of an inequitable experimental setup, not a revolutionary performance gain in like-for-like inference.
Overstated Aggregation Performance: The paper prominently features aggregation speedups of over 900x (Abstract, Figure 13). While technically representing the performance on one sub-task, this is misleading. For a system paper focused on end-to-end latency, this cherry-picked metric dramatically inflates the perceived contribution. The end-to-end latency, which is the only metric that matters to the end-user, shows far more modest gains.
Unquantified Cumulative Accuracy Loss: The proposed system introduces at least three distinct approximations: a. Pruning of target-to-train edges (Section 4.1). b. An "adjusted ReLU threshold" to increase sparsity (Section 4.2). c. The inherent staleness of precomputed embeddings between refreshes (Section 6.4). The paper analyzes these in isolation (or not at all, in the case of pruning's specific % drop) but never presents a clear analysis of their cumulative impact. The final accuracy of the EOD system at the moment just before a refresh cycle, compared to a gold-standard model with no approximations, is never reported. This leaves the true accuracy cost of the system entirely ambiguous.

Questions to Address In Rebuttal

What is the wall-clock time required for the "periodic re-precomputation" on the datasets evaluated (e.g., Products, Reddit)? How does this "maintenance latency" compare to the aggregated inference latency served during one refresh period? Please justify how a system requiring such a costly refresh operation can be considered a "low-latency" solution in a continuously operating online service.
Please justify the use of the standard GPU as the primary baseline for end-to-end speedup claims, given that it performs a fundamentally larger online workload (L-hop traversal) than EOD. Why should the results not be primarily framed in comparison to the "GPUpre" baseline, which represents a more direct, apples-to-apples comparison of inference hardware for a precomputation-based algorithm?
Provide a table detailing the final, end-to-end inference accuracy of the EOD system under the combined effects of all three approximations (pruning, ReLU thresholding, and data staleness at the end of a refresh cycle) compared to a non-approximated baseline model.
The multi-layer concatenation compression scheme appears to depend on the properties of the ReLU activation function. How does this technique perform with other common GNN models that use different activation functions (e.g., LeakyReLU, GeLU) or architectures that do not rely on simple activation sparsity (e.g., GAT)? Please comment on the generality of this core contribution.

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents EOD, a hardware-software co-designed system aimed at tackling the critical challenge of low-latency inference for Graph Neural Networks (GNNs) in an inductive, real-time setting. The authors correctly identify that the primary bottlenecks are not just the GNN computation itself, but the extensive data preparation and host-to-device data transfer, exacerbated by the "neighborhood explosion" problem.

The core contribution is an elegant decoupling of the inference process. The authors propose to precompute the computationally heavy and data-intensive propagation among the existing "train" nodes offline. This transforms the online inference problem from an L-hop graph traversal to L separate 1-hop aggregations from the precomputed train node features to the new "target" nodes. To manage the storage overhead of these precomputed features, they introduce a novel "concatenated ZVC" compression scheme that exploits sparsity patterns across different GNN layers. This algorithmic approach is supported by a DIMM-based Near-Memory Processing (NMP) architecture designed specifically to perform the parallel 1-hop aggregations on the compressed data, thereby minimizing data movement and leveraging memory-side parallelism. The result is a system that dramatically reduces end-to-end latency for small-batch GNN serving.

Strengths

The true strength of this paper lies in its insightful problem formulation and the holistic, co-designed solution it proposes.

Addressing the Right Problem: The GNN acceleration literature is crowded with work on full-batch training or transductive inference. This paper wisely targets the inductive, mini-batch inference scenario, which is far more representative of real-world, latency-sensitive applications like fraud detection and real-time recommendation. The analysis in Section 3 (page 4), which clearly shows that data preparation and memory copy dominate latency, is a crucial observation that correctly frames the entire problem and motivates the need for a system-level solution beyond a simple compute accelerator.
Elegant Algorithmic/Hardware Co-Design: This is not a paper that simply throws hardware at a problem. The algorithmic innovation—precomputing train-to-train propagation—is the key enabler. This single decision fundamentally changes the nature of the online workload, making it vastly more regular and amenable to a specialized NMP architecture. The subsequent compression scheme (Section 4.2, page 5) is a clever and necessary component to make the precomputation practical from a memory capacity perspective. The hardware is therefore not a generic NMP solution, but one tailored to execute the specific "concatenate aggregation" task enabled by the algorithm. This synergy is the paper's most compelling feature.
Connecting Disparate Research Threads: This work serves as an excellent synthesis of several important trends in computer science. It sits at the intersection of:
- GNNs: Taking models like GraphSAGE out of the lab and into production environments.
- Near-Memory Processing: Applying the principles pioneered in domains like recommendation systems (e.g., RecNMP) to the unique data access patterns of GNN inference.
- Systems for ML: Recognizing that deploying ML is a full-stack problem, where data movement, I/O, and preprocessing are often more critical than the matrix multiplications themselves.

By building these bridges, the paper provides a valuable blueprint for future research in practical ML systems.

Weaknesses

The weaknesses of the work are not in its core idea, which is sound, but in the assumptions that bound its current applicability. As a synthesizer, I see these less as flaws and more as the most fertile grounds for future work.

The Static "Training Graph" Assumption: The precomputation strategy is highly effective but hinges on the set of training nodes and their features being largely static. In many real-world systems (e.g., social networks or e-commerce platforms), the graph is constantly evolving with new users, products, and interactions that become part of the "known" graph. The paper's proposed solution of "periodic re-precomputation" (Section 6.4, page 12) is a practical but reactive fix. This approach avoids the crucial question of how to incrementally and efficiently update the precomputed embeddings as the base graph changes, which is a significant challenge for real-world deployment.
Single-Node System Abstraction: The evaluation is performed in the context of a single server with an in-memory graph database. This is a reasonable starting point, but the largest and most valuable graphs are almost always distributed across a cluster of machines. The paper acknowledges this limitation in the discussion (Section 7, page 13). In a distributed setting, the data-fetching and aggregation step would involve network latency, which could easily become the new dominant bottleneck, potentially negating some of the benefits of NMP. The paper would be strengthened by a more thorough discussion of how the EOD paradigm might extend to a distributed memory or storage environment.
Limited Generality of Observations: The "concatenated ZVC" compression method is motivated by the empirical observation that sparsity in hidden features can be inversely correlated across layers (Figure 5, page 6). This is an interesting finding, but its generality is not established. It may be a specific artifact of the GraphSAGE model with ReLU activations on the tested datasets. It is unclear if this property holds for other popular architectures (e.g., GATs) or different activation functions, which could limit the effectiveness of the proposed compression scheme.

Questions to Address In Rebuttal

Could the authors elaborate on the cost and complexity of the "refresh" or re-precomputation step? In a production environment with a constant stream of updates, how would one determine the optimal refresh frequency to balance the trade-off between inference accuracy (which decays as the graph becomes stale) and the computational cost of the refresh?
While a full distributed implementation is beyond scope, could the authors speculate on the architectural changes EOD would require to function in a sharded graph environment? For instance, would NMP modules need to communicate with each other (e.g., via a technology like DIMM-Link, mentioned in their citations), and how would the precomputed features be managed and accessed across the cluster?
Regarding the concatenated ZVC compression: Have you investigated if the observed inverse sparsity correlation holds for GNN models other than GraphSAGE, or for different non-linearities besides ReLU? How critical is a high compression ratio to the overall performance of EOD, and how gracefully does the system's performance degrade if the compression is less effective?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper, "EOD: Enabling Low Latency GNN Inference via Near-Memory Concatenate Aggregation," proposes a co-designed hardware/software system to accelerate real-time, inductive Graph Neural Network (GNN) inference. The authors identify two primary bottlenecks: the data preparation/transfer overhead and the "neighborhood explosion" problem.

The core of their proposed solution consists of three main components: 1. An algorithmic optimization based on precomputing the hidden features for all training nodes (tr-to-tr propagation), thereby reducing the online inference workload to L separate 1-hop aggregations instead of a single L-hop aggregation. 2. A compression scheme that concatenates the hidden features of a node across all GNN layers before applying Zero Value Compression (ZVC), which aims to improve the compression ratio by averaging out sparsity variations between layers. 3. A DIMM-based Near-Memory Processing (NMP) architecture specifically designed to accelerate the tr-to-tar aggregation step on data stored in this concatenated and compressed format.

While the paper demonstrates substantial performance improvements, the novelty of the core constituent ideas is limited. The primary contribution lies in the specific synthesis and co-optimization of these components into a functional and high-performance system for a very specific workload.

Strengths

Novel System-Level Integration: The primary strength of this work is the coherent and tightly-coupled integration of pre-existing concepts (precomputation, ZVC, NMP) into a specialized system. While the individual parts are not new, their combination to solve the low-latency inductive GNN inference problem is a novel system-level engineering contribution.
Minor Algorithmic Novelty in Compression: The idea of concatenating hidden features across multiple layers before applying ZVC (Section 4.2, Pages 5-6) is a clever, albeit incremental, optimization. By combining feature vectors that may have anti-correlated sparsity patterns (as shown in Figure 5), the authors create a more favorable data distribution for compression. I have not seen this specific technique applied in prior GNN acceleration work.
Specialized Hardware Co-design: The NMP architecture is novel in its specialization. It is not a generic NMP unit; it is purpose-built to handle the proposed concatenated ZVC data format, including a custom instruction (Agg-instruction, Figure 10, Page 8) and an integrated ZVC decompressor (Figure 9d, Page 8). This demonstrates a deep co-design.

Weaknesses

The central weakness of this paper, from a novelty perspective, is that its foundational algorithmic and architectural pillars are built upon well-established prior art.

Core Algorithmic Premise is Not Novel: The main algorithmic trick—pruning tar-to-tr edges and precomputing tr-to-tr propagation to simplify online inference (Section 4.1, Page 4-5)—is not a new idea. This exact concept of decoupling training and test node computations to accelerate inductive inference has been explored before. Specifically, the work by Si et al., "Serving graph compression for graph neural networks" (ICLR 2023) [51], which the authors cite, proposes this very same decoupling strategy as its central contribution. The current paper re-implements this known technique as the basis for its hardware acceleration, but the fundamental insight is not original.
Architectural Paradigm is Not Novel: The use of DIMM-based NMP for accelerating aggregation-heavy workloads is a known pattern. Prior works such as RecNMP [24] and Tensordimm [28] established the viability of placing logic on a DIMM's buffer chip to process embedding table lookups for recommendation models. More directly, GNNear [64] and GraNDe [61] have already proposed using DIMM-based NMP to accelerate GNNs. EOD follows this established architectural template. Its novelty is confined to the specific logic implemented within the NMP unit, which is tailored to its specific data format, rather than a fundamentally new NMP architecture.
Compression Primitive is Standard: The use of Zero Value Compression (ZVC) to exploit activation sparsity is a standard technique in ML accelerators, as seen in works like Rhu et al., "Compressing DMA engine" (HPCA 2018) [47]. The novelty in EOD lies only in the pre-processing step (concatenation), not in the compression mechanism itself.

In essence, the paper takes a known algorithm from [51], implements it on a known architectural template from [24, 61, 64], and uses a standard compression primitive [47] with a minor pre-processing twist. The impressive speedup numbers are a result of aggressively applying this known precomputation algorithm, which fundamentally changes the problem from an L-hop graph traversal to L independent table lookups—a task for which NMP is known to be highly effective.

Questions to Address In Rebuttal

The core precomputation strategy described in Section 4.1 appears functionally identical to the decoupling method proposed in your cited work [51]. Please clarify, with technical precision, what the novel algorithmic difference is between your method and the one in [51]. If there is no significant difference, please justify why building hardware for a known algorithm constitutes a sufficient contribution for this venue.
The multi-layer concatenation technique for ZVC is presented as a key enabler. Is this technique generally applicable to other GNN acceleration scenarios, or is its utility strictly limited to a context where all layers' hidden features are precomputed and stored? Its novelty is proportional to its generality.
Given that the primary performance gain comes from transforming the problem into one that is embarrassingly parallel (L independent 1-hop aggregations), how much of the benefit is from the NMP architecture versus the algorithm itself? Figure 14a shows that even a "GPUpre" baseline (precomputation on GPU) achieves a significant speedup (12.0-16.4x). This suggests the algorithm is the dominant factor. Please argue for the significance of the NMP architecture's novel aspects beyond simply being a good fit for a known memory-bound problem.

RAP: Reconfigurable Automata Processor

Abstract

Regular pattern matching is essential for applications such as text processing, malware detection, network security, and bioinformatics. Recent in-memory automata processors have significantly advanced the energy and memory efficiency over conventional ...

ACM LINK

Reviews

Review 1

Here is the completed review form from the perspective of "The Guardian."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present RAP, a reconfigurable in-memory automata processor designed to support three distinct automata models: NFA, NBVA, and LNFA. The central thesis is that real-world regex workloads are diverse (as shown in Figure 1), and a single-mode processor is inherently inefficient. By reusing 8T-SRAM cells for character classes (CCs) and bit vectors (BVs) and implementing specialized logic for different modes, the authors claim significant improvements in energy efficiency and compute density over state-of-the-art (SotA) specialized processors.

However, the paper's claims of superiority are predicated on a series of unquantified overheads, arbitrary compilation heuristics, and a potentially biased evaluation framework. While the motivation is sound, the presented evidence lacks the rigor required to fully substantiate the claimed benefits of reconfigurability.

Strengths

Clear Motivation: The paper effectively motivates the need for a reconfigurable architecture. The analysis in Figure 1 (page 1) provides a compelling, data-driven rationale that different workloads benefit from different automata representations, which is the foundational premise of this work.
Core Architectural Concept: The fundamental idea of repurposing the same in-memory fabric (8T-SRAM and local switches) to implement distinct computational models (NFA state matching, NBVA bit vector storage/manipulation, and LNFA Shift-And execution) is a technically sound and resourceful approach.

Weaknesses

Unquantified Overhead of Reconfigurability: The paper repeatedly claims the reconfigurable design has "little overhead" (Abstract, Section 3) but fails to provide a quantitative analysis to support this. The architecture requires additional components that are not present in a single-mode processor:
- The NBVA mode requires metadata to identify BV-STEs, an overflow checker, and control logic for the pipelined bit-vector-processing phase (Section 3.1, page 5).
- The LNFA mode repurposes pipeline buffers and requires a "specialized routing path for the active vector to facilitate bitwise shift" (Section 3.2, page 7). These components consume static and dynamic power and silicon area, irrespective of the mode being used. The paper does not provide a breakdown of these overhead costs, making it impossible to assess the true cost-benefit of the proposed reconfigurability. The improvements shown might be entirely consumed by these overheads in workloads that do not heavily leverage all modes.
Arbitrary Compiler Heuristics: The decision-making process in the compiler relies on unsubstantiated heuristics. Specifically, in Section 4.2 (page 9), the compiler will use LNFA mode only if the required state expansion "does not increase the number of states by more than 2x." This "2x" threshold is arbitrary. There is no sensitivity analysis or justification provided for this specific value. A different threshold could dramatically alter the NFA/LNFA workload distribution and, consequently, the final performance results. This undermines the claim that the compiler finds the "best automata model" (Section 2, page 2).
Subjective Design Space Exploration (DSE): The selection of "optimal" parameters from the DSE in Section 5.3 (page 10 and Figure 10) appears subjective. The authors state, "We choose the depth that improves energy and area while offering acceptable throughput." The term "acceptable" is not defined quantitatively. For example, in Figure 10(a) for the Snort benchmark, depth=32 offers better area than the chosen depth=16. It appears parameters were selected to present the most favorable narrative rather than by applying a consistent, rigorous optimization function. This potential for post-hoc rationalization casts doubt on the optimality of the final configuration.
Opaque System-Level Aggregation: A critical detail regarding performance normalization is buried in Section 5.5 (page 11). The authors state that if an array's throughput in NBVA mode falls below 2 Gch/s, they "assign another RAP array to work on the same regexes to share the workload." This effectively duplicates hardware resources for certain benchmarks. It is unclear if this duplication is reflected in the final area, power, and compute density figures presented in Figure 12. If a benchmark required two arrays, its area and power should be doubled, and its compute density halved. The paper does not specify which benchmarks required this treatment, making the final summary comparisons in Figure 12 potentially misleading.
Potentially Biased Benchmark Comparison: The evaluation in Tables 2 and 3 compares RAP against SotA ASICs (CAMA, BVAP, CA) using a new set of benchmarks curated by the authors. While claimed to be more "up-to-date," this prevents a direct, apples-to-apples comparison against the published results for those architectures, which were often evaluated on standard suites like ANMLZoo. Without a characterization showing how these new benchmarks differ from established ones, it is difficult to determine if RAP's superior performance stems from its architecture or from a benchmark set that happens to align perfectly with its strengths.

Questions to Address In Rebuttal

Please provide a detailed area and power breakdown of the hardware components required specifically for reconfigurability. This includes the mode-selection logic, control logic for the NBVA pipeline, the specialized LNFA active-vector routing, and any other overheads not present in a baseline single-mode NFA processor like CAMA.
Justify the "2x" state-expansion heuristic for LNFA compilation (Section 4.2). Provide a sensitivity analysis showing how the overall system performance changes if this threshold is set to 1.5x, 2.5x, or 3x.
Define the precise, quantitative cost function used to select the final parameters from your design space exploration (Figure 10). Specifically, how did you weigh throughput penalties against area and energy savings to define an "acceptable" trade-off?
Clarify the system-level mapping policy mentioned in Section 5.5. Which of the evaluated benchmarks required the use of multiple RAP arrays to meet the 2 Gch/s throughput target? How are the duplicated area and power consumption for these benchmarks accounted for in the final aggregated results presented in Figure 12?

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces RAP, a novel reconfigurable in-memory automata processor designed to efficiently execute regular expressions across diverse workloads. The central insight motivating this work is that no single automata model (NFA, NBVA, or LNFA) is optimal for all types of regexes commonly found in real-world applications like network security or bioinformatics. The authors compellingly demonstrate this workload diversity with an analysis of seven benchmark suites (Figure 1, page 1).

RAP addresses this challenge by creating a unified hardware fabric based on 8T-SRAM that can be dynamically configured to operate in three distinct modes, each tailored to a specific automata model. For the NBVA model, it introduces a unified storage scheme for character classes and bit vectors. For the LNFA model, it implements a hardware-accelerated Shift-And algorithm with an energy-saving binning technique. This architectural flexibility is paired with a compiler that analyzes regexes and maps them to the most efficient execution model. The evaluation shows that RAP achieves significant improvements in energy efficiency and compute density compared to prior single-mode specialized accelerators (CAMA, BVAP) and offers orders-of-magnitude gains over state-of-the-art CPU and GPU solutions.

Strengths

Excellent Problem Motivation and Framing: The paper's greatest strength is its clear, data-driven motivation. The analysis in Figure 1 (page 1) is the cornerstone of the work, elegantly demonstrating that regex workloads are not monolithic. This immediately establishes a strong justification for a reconfigurable approach and positions RAP not as just another accelerator, but as a practical solution to a well-characterized, real-world problem. This contextualizes the work perfectly within the existing landscape of automata processing.
Novel and Elegant Architectural Synthesis: The core contribution is the architectural unification of three distinct and important automata execution models. While prior work has explored specialized ASICs for NFA (CAMA), NBVA (BVAP), and software solutions for LNFA-like patterns (Hyperscan), RAP is the first to synthesize them into a single, cohesive hardware substrate. The design is clever in its reuse of core components; the repurposing of 8T-SRAM cells for both CAM-based state matching and bit-vector storage, and the reconfiguration of the local switch fabric for state transitions or bit-vector actions, is an elegant solution that minimizes overhead.
Strong Cross-Stack Design: The work is not just a hardware design; it includes a full compilation stack (Section 4, page 8) that bridges the gap between high-level regular expressions and the low-level hardware configuration. The inclusion of a compiler that performs rewriting, mode selection (Figure 9, page 8), and hardware mapping demonstrates a mature and comprehensive approach to the problem. This "full-stack" perspective is crucial for the usability and practical impact of any specialized accelerator.
Comprehensive and Rigorous Evaluation: The evaluation is extensive, comparing RAP against a wide spectrum of contemporary solutions: other specialized ASICs, high-performance CPU and GPU libraries, and FPGA designs. The use of seven diverse, real-world benchmark suites allows the authors to showcase the benefits of reconfigurability across different scenarios, reinforcing the paper's central thesis.

Weaknesses

While this is a strong paper, there are areas where its context and implications could be explored further.

Overhead of Reconfigurability: The paper successfully demonstrates the benefits of its approach on mixed workloads. However, it could provide a more direct analysis of the area, power, and control logic overhead incurred by the reconfigurability itself. The results for the NFA-dominant RegexLib benchmark (Figure 12, page 12) show a slight performance degradation compared to baselines, which is attributed to controller overhead. A deeper dive into this trade-off would be valuable. How much more efficient could a non-reconfigurable, "pure" NFA or NBVA processor be if built with the same underlying technology? This would help contextualize the "cost of flexibility."
Complexity of the Compilation and Mapping: The decision graph in Figure 9 (page 8) and the mapping heuristics for binning LNFAs (Section 3.2, page 7) suggest a complex compilation process. The paper could elaborate on the sensitivity of the final results to these heuristics. For instance, the 2x state expansion threshold for choosing the LNFA model is a key parameter. A sensitivity analysis would strengthen the claim that the compiler makes robustly optimal choices.
Future Architectural Directions: The work positions itself as the "first" reconfigurable automata processor. As such, it opens a new design space. The paper could benefit from a brief discussion on the future evolution of this concept. For example, the current design configures modes at deployment time. What would be the challenges and opportunities in enabling more dynamic, runtime reconfiguration in response to changing data streams? This would help connect the present work to the future research it enables.

Questions to Address In Rebuttal

Could the authors quantify the overhead (in terms of area, power, and control logic complexity) of supporting all three modes compared to a hypothetical, single-mode processor built with the same base technology? For instance, how much larger is an RAP tile than a tile designed only for NFA processing using the CAMA architecture?
The compiler's rewriting rules, particularly for LNFA (Section 4.2, page 9), involve a 2x state expansion threshold. How was this threshold determined, and how sensitive is the overall system performance to this and other compiler heuristics? Is it possible for a suboptimal choice by the compiler to negate the benefits of the underlying hardware?
The paper discusses configuring tiles to different modes at deployment time. Is it feasible to dynamically reconfigure tiles or arrays during runtime as the characteristics of an input data stream change? What would be the performance and energy implications (e.g., latency of reconfiguration, state flushing) of such a dynamic capability?

Review 3

Here is a review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents RAP, an in-memory automata processor architecture claimed to be the first to be reconfigurable to efficiently support three distinct automata models: Nondeterministic Finite Automata (NFA), Nondeterministic Bit Vector Automata (NBVA), and Linear NFA (LNFA). The central thesis is that real-world regex workloads are diverse (as shown in Figure 1), and a single-purpose accelerator is therefore suboptimal. The proposed solution achieves reconfigurability by repurposing its core in-memory fabric—specifically, using 8T-SRAM structures as either Content-Addressable Memory (CAM) for NFA state matching or as storage for bit vectors in NBVA mode, and reconfiguring the local crossbar switches to implement either NFA transitions, NBVA bit-vector actions, or the routing for a bit-parallel Shift-And algorithm for LNFA. The core architectural novelty lies in the tight integration and unification of these processing models onto a single hardware substrate, moving beyond prior art which has focused on specialized, single-mode accelerators.

Strengths

The primary strength of this paper is its central, novel architectural concept.

Core Architectural Unification: The claim of being the "first reconfigurable in-memory automata processor" of this type appears to hold. Prior art has produced specialized processors. For instance, CAMA [18] focuses on an efficient NFA substrate, while BVAP [52] proposes an architecture specifically for NBVA by adding a dedicated Bit Vector Module (BVM) to an NFA processor. The key novelty of RAP is the elimination of such dedicated, single-purpose modules. Instead of adding a separate BVM, RAP integrates the bit vector storage and manipulation directly into the main processing fabric. This unification is a significant and novel architectural contribution.
Novel Repurposing of In-Memory Fabric: The technical execution of this unification is clever. The paper details how columns of the 8T-SRAM array are dynamically designated as either character class (CC) storage (acting as a CAM) or bit vector (BV) storage (acting as SRAM) (Section 3.1, page 5). Critically, the corresponding columns in the local switch are also repurposed. For NFA mode, they encode state transitions. For NBVA mode, they are reconfigured to encode BV actions like shift, copy, and read (Figure 5, page 5). This dual-use of the switch fabric is a specific, non-obvious, and novel design element.
Compiler Co-Design for Reconfigurability: The novelty is not confined to hardware. The work includes a compiler that makes strategic decisions about which automata model to use for a given regex (Figure 9, page 8). This demonstrates a full-stack approach where the hardware's novel reconfigurability is made accessible and useful. This is a crucial part of the contribution, as novel hardware without a path to programmability is of limited value.

Weaknesses

While the overarching concept is novel, a breakdown of the constituent parts reveals that some of the underlying ideas are extensions of existing work. The novelty lies almost entirely in the synthesis.

Incremental Nature of Component Ideas: The individual processing modes are not new. NFA execution on CAM-based in-memory architectures is the core concept of CAMA [18]. The use of bit vectors to handle bounded repetitions is the central idea of NBVA and the BVAP [52] accelerator. The Shift-And algorithm [3] for linear patterns is a classical, well-understood technique. The novelty of RAP is not in inventing these methods, but in creating a single substrate that can morph to execute all three. The contribution must be evaluated on the quality of this synthesis, not on the originality of its parts.
Limited Algorithmic Novelty in LNFA Implementation: The implementation of LNFA mode (Section 3.2, page 6) appears to be the least novel aspect. It is a direct hardware mapping of the Shift-And algorithm. While the repurposing of the active vector and routing paths is a necessary engineering step to fit it onto the RAP substrate, the core bit-parallel mechanism itself is not new. The "Multi-LNFA Binning" optimization (Section 3.2, page 7) is a clever mapping strategy to improve power efficiency, but it is an optimization of a known technique rather than a fundamentally new processing paradigm.
Ambiguity in the Novelty of "Unified Storage": The paper claims novelty in unified storage for CCs and BVs (Section 3.1, page 4). The concept of using a memory array for different purposes is, at a high level, not new. The key is in the specifics. The paper relies on prior work [24] showing 8T-SRAM can be repurposed as a CAM. The novel step is to use this known duality to dynamically partition a single array to hold both data types for a new application (automata processing). The authors should be careful to frame this as a novel application of a known circuit technique, rather than a fundamentally new circuit itself.

Questions to Address In Rebuttal

The distinction from BVAP [52] is central to the novelty claim. BVAP uses a dedicated, appended BVM, whereas RAP integrates this functionality into the main CAM/switch fabric. Could the authors elaborate on the novelty of the specific circuit-level techniques or architectural reconfigurations that enable this integration? Is the "BV actions Encoding Scheme" (Section 3.1, page 5) the sole enabler, or are there other non-obvious modifications required in the CAM or switch design that differentiate it from simply co-locating a smaller BVM inside a CAMA tile?
The hardware mapping of the Shift-And algorithm is presented as a contribution for LNFA support. Beyond the binning optimization, what is the core novelty in this mapping compared to prior specialized hardware implementations for bit-parallel matching? Is the main contribution simply demonstrating that such an algorithm can be mapped to this specific reconfigurable fabric?
A unified, reconfigurable design inherently involves compromises compared to a specialized one. The paper claims "little overhead" (Abstract, page 1). What specific performance, area, or energy efficiencies were sacrificed in the NFA mode (compared to a pure, non-reconfigurable CAMA-style design) and in the NBVA mode (compared to a pure BVAP-style design) to enable the reconfigurability for the third, LNFA mode? A quantification of this "reconfigurability tax" would help in assessing the significance of the novelty.

Hybrid SLC-MLC RRAM Mixed-Signal Processing-in-Memory Architecture for Transformer Acceleration via Gradient Redistribution

Abstract

Transformers, while revolutionary, face challenges due to their demanding computational cost and large data movement. To address this, we propose HyFlexPIM, a novel mixed-signal processing-in-memory (PIM) accelerator for inference that flexibly utilizes ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the persona of "The Guardian."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose HyFlexPIM, a mixed-signal Processing-in-Memory (PIM) architecture that utilizes hybrid single-level cell (SLC) and multi-level cell (MLC) RRAM. This hardware is coupled with an algorithm co-optimization technique, "gradient redistribution," which uses Singular Value Decomposition (SVD) and fine-tuning to concentrate a model's critical parameters into a small subset of weights. The stated goal is to map these critical weights to the high-precision SLC RRAM and the remaining weights to the high-density MLC RRAM, thereby improving efficiency while mitigating the accuracy loss typically associated with analog MLC PIM for Transformer models.

Strengths

The work correctly identifies a fundamental challenge in the field: the severe accuracy degradation of complex models like Transformers when implemented on analog MLC PIM architectures due to non-idealities.
The proposed solution attempts to address this challenge through a hardware-software co-design approach, which is more comprehensive than a purely hardware-centric or algorithm-centric solution.
The evaluation is conducted across a reasonable range of model types (encoder, decoder, vision) and datasets, demonstrating an attempt at comprehensive validation.

Weaknesses

The paper’s central claims rest on a foundation that appears insufficiently supported by the provided evidence. The methodology contains several points that require significant clarification and justification.

The Central Premise of "Gradient Redistribution" is Anecdotal: The core of the software contribution is the claim that SVD and fine-tuning can "concentrate significance into a small subset of weights" (Abstract). The primary evidence for this is Figure 11, which shows a change in gradient distribution. However, this figure is presented for a single "FC layer in BERT-Base." This is insufficient evidence to establish a general principle. It is not demonstrated that this phenomenon holds for all layers (e.g., attention vs. FFN), or across all evaluated models (GPT-2, Llama3, ViT). Without broader evidence, the entire co-design strategy is based on a single, potentially anomalous, data point.
Claims of "Negligible" Accuracy Impact are Overstated and Misleading for Decoder Models: The abstract claims "negligible accuracy repercussions (less than 1%)". While this appears to hold for some encoder tasks (Figure 12a), the evaluation for decoder models (Figure 12b) uses "Loss" as the metric, not accuracy or perplexity. The text states that with a 20% SLC rate, there is a "less than a 10% increase in loss" (Section 6.1, page 10). A 10% increase in loss is not "negligible" and can lead to a significant degradation in the quality of generated text. The paper fails to provide any qualitative evaluation (e.g., sample outputs) or standard perplexity metrics to demonstrate that this loss increase is acceptable in practice.
The Comparison to Baselines is Potentially Unfair: To compare against ASADI [31], which uses FP32, the authors create a modified baseline, ASADI†, which uses INT8 "for fair comparison" (Section 5.3, page 9 and Section 6.3.1, page 11). This modification is problematic. The architecture of ASADI may be fundamentally optimized for FP32 operations; simply quantizing its inputs to INT8 without re-evaluating its entire design could artificially cripple its performance and efficiency. A truly fair comparison would require a more rigorous justification of this modification or comparison against a state-of-the-art INT8-native accelerator.
Hardware Overhead Claims Appear Minimized: The paper claims the reconfigurable SLC/MLC mechanism has "minimal overhead (<1% for area & energy)" (Abstract, page 1). The core of this mechanism is a flexible SAR-ADC that can switch between 6-bit and 7-bit operation (Section 3.2, page 6). The authors correctly state that increasing ADC precision by one bit roughly doubles energy consumption. Given that the ADC constitutes 64.2% of the area and 55% of the power of the analog PIM module (Table 2), a 7-bit ADC represents a substantial overhead compared to a dedicated 6-bit ADC. The "<1%" figure seems to be calculated relative to the entire multi-PU chip, which is a misleading denominator that trivializes a significant cost at the module level.
Non-Ideality Modeling is Oversimplified: The accuracy analysis relies on a simulator that models RRAM noise using a single Gaussian distribution based on a BER from prior work (Section 5.2, page 9). This model omits several other critical non-idealities inherent to real-world analog PIM crossbars, such as IR drop across long wordlines/bitlines, non-linearity of peripheral circuits (e.g., DACs, S&H), and inter-cell crosstalk. Without accounting for these effects, the reported accuracy results may be overly optimistic.

Questions to Address In Rebuttal

Please provide evidence that the "gradient redistribution" effect shown in Figure 11 is a general phenomenon. Specifically, show gradient distributions for other layers within BERT-Base, and for representative layers from the GPT-2, Llama3, and ViT models.
For the decoder model evaluations, please justify why a loss increase of up to 10% should be considered acceptable. Provide standard perplexity scores and/or qualitative examples of generated text from the baseline model and from your proposed architecture (e.g., with 20% SLC rate) to demonstrate that the output quality is not meaningfully degraded.
Please provide a robust defense for the fairness of the ASADI† baseline. Did you consider how the ASADI architecture might be re-optimized if designed for INT8 from the ground up? How does your modification account for such potential optimizations?
Please clarify the hardware overhead calculation. What is the area and energy overhead of using a 7-bit capable ADC compared to a 6-bit ADC, expressed as a percentage of the analog PIM module's total area and energy, not the entire chip?

Review 2

Excellent. This is a fascinating paper. I will now embody "The Synthesizer" to provide a contextual review.

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents HyFlexPIM, a hardware-software co-designed solution for accelerating Transformer inference using a mixed-signal Processing-in-Memory (PIM) architecture based on Resistive RAM (RRAM). The core contribution is not merely the hybrid hardware, which flexibly utilizes both high-precision but costly single-level cell (SLC) RRAM and high-density but noisy multi-level cell (MLC) RRAM. Instead, the central and most compelling idea is an algorithmic technique termed "gradient redistribution."

This technique proactively reshapes the Transformer model's weight structure to make it explicitly amenable to the hybrid hardware. By employing Singular Value Decomposition (SVD) and a targeted fine-tuning process, the authors demonstrate the ability to concentrate a model's "importance"—as measured by gradient magnitudes—into a small fraction of its parameters (5-10%). This allows the vast majority of the model to be mapped to the highly efficient MLC portion of the hardware, while the small, critical subset is protected in the accurate SLC portion, thus striking a novel and effective balance between efficiency and accuracy. The paper evaluates this co-design on a range of Transformer models, showing significant throughput and energy efficiency gains over state-of-the-art PIM and non-PIM baselines.

Strengths

Novelty and Elegance of the Core Idea: The primary strength of this work lies in its foundational insight: instead of passively relying on a model's inherent error resilience, we can actively reshape the model to amplify its resilience and make it conform to the characteristics of the underlying hardware. The "gradient redistribution" technique (Section 4, page 7) is an elegant repurposing of well-known methods like SVD and fine-tuning. It shifts the objective from pure compression to hardware-aware importance concentration. This represents a significant conceptual advance in the field of hardware-algorithm co-design.
Addressing a Critical Bottleneck in Analog PIM: The Achilles' heel of analog PIM, especially for complex models like Transformers, has always been its susceptibility to noise and non-idealities, leading to unacceptable accuracy degradation. Many researchers have either retreated to digital PIM or limited analog PIM to less critical tasks. This paper confronts the problem head-on by creating a clear, algorithmically-defined demarcation between what must be precise and what can be approximated. This approach could provide a viable path forward for deploying high-efficiency analog compute for the most demanding AI workloads.
Holistic and Well-Integrated System Design: The work demonstrates a commendable end-to-end perspective. It connects a high-level algorithmic concept (gradient redistribution) to a detailed architectural design (HyFlexPIM in Section 3, page 4) and even considers circuit-level realities (the reconfigurable 6/7-bit ADC in Section 3.2, page 6). This integration makes the proposed solution feel grounded and plausible, rather than a purely theoretical exercise. The justification for using 2-bit MLC based on realistic device error rates is particularly well-argued.
Comprehensive Experimental Validation: The evaluation is thorough, spanning encoder-only (BERT), decoder-only (GPT-2, Llama3), and vision (ViT) models across multiple datasets (Figure 12, page 10). The comparison against relevant and strong baselines like ASADI and SPRINT provides a clear context for the claimed improvements and strengthens the paper's conclusions.

Weaknesses

While the core idea is strong, the work's presentation and scope leave a few areas for deeper consideration. These are not fatal flaws but rather opportunities for strengthening the work.

Scalability of the Pre-processing Step: The gradient redistribution algorithm relies on SVD and 1-3 epochs of fine-tuning. While this is a one-time, offline cost, the computational complexity of SVD on the very large weight matrices found in foundation models (e.g., 100B+ parameters) is non-trivial. The paper could benefit from a discussion on the practical feasibility and cost of applying this pre-processing step to state-of-the-art, large-scale models.
Discrepancy in Generative Model Performance: The results for decoder models like GPT-2 and Llama3, while still showing a benefit, are less impressive than for encoder models. The authors note a "less than 10% of loss increase" (Section 6.1, page 10), which is a more significant degradation than the "<1% accuracy drop" seen in BERT. The paper does not offer a deep analysis of why this discrepancy exists. Is it due to the autoregressive nature of decoding, error accumulation, or some other fundamental difference in the models' sensitivity? A more profound exploration of this point would be valuable.
Generality of the Architecture: The HyFlexPIM architecture is presented as a co-design, and its symbiosis with the gradient redistribution algorithm is clear. However, it's less clear how the architecture would perform on models or tasks that have not undergone this specific pre-processing. A brief discussion on the architecture's general-purpose utility would help contextualize its flexibility beyond this specific use case.

Questions to Address In Rebuttal

Could the authors elaborate on the computational cost of the gradient redistribution pre-processing (SVD and fine-tuning) and comment on its viability for models at the scale of today's largest foundation models?
Could the authors provide more insight into why decoder-based generative models appear more sensitive to the hybrid SLC/MLC mapping than encoder models, resulting in a more noticeable increase in loss? Does this suggest a fundamental limitation or an area for future algorithmic refinement?
The core principle of "gradient redistribution" seems broadly applicable. Can the authors comment on whether this technique could be valuable for other heterogeneous computing platforms beyond RRAM PIM—for instance, mapping workloads to a mix of high-precision and low-precision digital ALUs or different memory tiers?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper proposes "HyFlexPIM," a mixed-signal Processing-in-Memory (PIM) architecture for accelerating Transformer inference. The architecture is described as a hybrid system, utilizing digital PIM for dynamic, high-precision operations (e.g., attention) and analog PIM for static, linear operations (e.g., weights in FC layers). The core of the analog PIM is its further hybridization, using high-precision Single-Level Cell (SLC) RRAM for critical weights and high-density Multi-Level Cell (MLC) RRAM for less critical weights.

The authors' primary claim to novelty lies not in the architecture itself, but in an algorithm-hardware co-optimization technique they term "gradient redistribution." This technique employs Singular Value Decomposition (SVD) and subsequent fine-tuning to intentionally concentrate the model's "importance"—as measured by gradient magnitudes—into a small fraction of singular values. This reshaping of the model's internal structure makes it highly amenable to the proposed hybrid SLC/MLC hardware mapping, enabling significant efficiency gains with minimal accuracy loss.

Strengths

The primary strength and sole novel contribution of this work is the concept of "gradient redistribution." While the individual components of this technique are well-established, their synthesis for this specific purpose appears to be new.

Novel Synthesis of Existing Techniques: The authors combine SVD, model truncation, and fine-tuning. None of these are new. However, the explicit goal of using the fine-tuning process not merely to recover accuracy but to actively reshape the gradient landscape (as shown in Figure 11c, page 8) for a specific hardware target (hybrid SLC/MLC PIM) is a clever and novel insight. The work shifts from passively mapping a model to actively engineering the model to fit the hardware.
A Clear Enabling Technology: The paper correctly identifies a key problem: a naive mapping of a Transformer model onto a hybrid SLC/MLC architecture would yield minimal benefits, as the important weights are not naturally concentrated in a small, easily identifiable subset. The proposed algorithm directly solves this problem, acting as an essential "enabler" that makes the hybrid hardware architecture viable and effective. This linkage between the algorithm and the hardware's potential is the paper's main contribution.

Weaknesses

The paper's novelty is narrowly focused on the co-design algorithm. The architectural aspects, which are heavily emphasized in the title and framing, are largely derivative of prior work.

Lack of Architectural Novelty: The core architectural concepts are not new.
- Mixed-Signal PIM: The idea of combining digital PIM for control-heavy/high-precision logic and analog PIM for dense multiply-accumulate operations is an established pattern in the PIM field. Prior works like ASADI [31] and ReHy [24, 25] have already proposed such hybrid digital/analog systems.
- Hybrid SLC/MLC Memory: The concept of using SLC for critical/high-endurance data and MLC for dense/less-critical data is a foundational principle in memory systems, particularly in NAND flash storage. Applying this to RRAM for PIM is a logical, but incremental, extension. Indeed, reference [20] in the authors' own bibliography describes a hybrid SLC-MLC ReRAM compute-in-memory macro.
Potentially Misleading Framing: The paper is titled and presented as a new architecture. However, the architectural design is a vehicle for the novel algorithm. The work would be more accurately framed as a new co-design algorithm that unlocks the potential of a known class of hybrid PIM architectures. The current framing overstates the novelty of the hardware contribution.
Limited Exploration of the Core Mechanism: The paper states that fine-tuning concentrates the gradients, but the mechanism is not deeply explored. Is this a natural consequence of fine-tuning a truncated model, or did the authors employ a specific regularizer or loss term to encourage this concentration? The novelty would be significantly stronger if the fine-tuning process itself was modified to explicitly drive gradient redistribution, rather than just observing it as a side effect.

Questions to Address In Rebuttal

Clarification of Architectural Delta: The concept of a hybrid digital-analog PIM and a hybrid SLC-MLC memory subsystem are both present in prior art. Can the authors precisely articulate what specific architectural element of HyFlexPIM (e.g., the reconfigurable ADC, the interconnect) is fundamentally novel, separate from the "gradient redistribution" algorithm?
Mechanism of Gradient Redistribution: As per Algorithm 1 (page 7), the fine-tuning step (Step 3) appears to be a standard training process. Is the observed gradient concentration an emergent property of standard fine-tuning on a post-SVD truncated matrix, or is there a non-standard aspect of the optimizer or loss function used to explicitly encourage this redistribution? If it is the former, how confident are the authors that this effect will generalize to other model architectures and compression techniques beyond SVD?
Interaction with Other Optimizations: The proposed method creates a model structure optimized for SLC/MLC mapping. How does this interact with other common optimization techniques like quantization or structured pruning? For instance, does the process of redistributing importance via gradients conflict with identifying weights that are amenable to quantization or entire filters that can be pruned? The novelty of the technique hinges on its utility, which may be diminished if it is incompatible with other state-of-the-art methods.

REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing

Abstract

Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. This limitation, combined with the significant cost of retraining renders them incapable of providing up-to-date responses. To ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the perspective of "The Guardian."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes REIS, an In-Storage Processing (ISP) system designed to accelerate the retrieval stage of Retrieval-Augmented Generation (RAG) pipelines. The authors identify the I/O transfer of the vector database from storage to host memory as the primary performance bottleneck. To address this, REIS implements three key mechanisms: (1) a database layout that separates embeddings from document chunks and links them via the NAND Out-Of-Band (OOB) area; (2) an ISP-friendly implementation of the Inverted File (IVF) ANNS algorithm; and (3) an ANNS engine that repurposes existing NAND peripheral logic (e.g., page buffers, XOR logic, fail-bit counters) to perform Hamming distance calculations directly within the flash dies. The authors claim that this approach significantly improves performance and energy efficiency over conventional CPU-based systems and prior ISP accelerators, crucially, without requiring any hardware modifications.

Strengths

Problem Motivation: The paper correctly identifies and quantifies a critical performance bottleneck in large-scale RAG systems: the I/O cost of loading the vector database from storage (Section 3.1, Figure 2). The motivation is clear and empirically grounded.
Resourceful Mechanism Design: The core idea of leveraging existing, but typically inaccessible, peripheral logic within NAND flash dies for computation (Section 4.3, Figure 6) is a creative and efficient use of resources. Using in-plane XOR and bit-counters for Hamming distance is a clever way to avoid adding dedicated MAC units.
Comprehensive Baselines: The evaluation compares REIS against a high-end CPU system, an idealized "No-I/O" case, and two state-of-the-art ISP-based accelerators (NDSearch and ICE). This provides a robust context for assessing the claimed performance improvements.

Weaknesses

The paper’s central claims rest on a series of optimistic assumptions and questionable design choices that undermine its practical viability and rigor.

The "No Hardware Modification" Claim is Disingenuous: The authors repeatedly emphasize that REIS requires no hardware modifications, but this claim is misleading.
- Hybrid SLC/TLC Partitioning: The system relies on programming a portion of the TLC SSD as SLC to achieve the reliability needed for ECC-less operation (Section 4.1.2). This is a major architectural decision with a direct cost: it sacrifices two-thirds of the capacity in that partition. This massive reduction in storage density is a practical modification with significant cost implications ($/GB) that are not acknowledged.
- Firmware and Logic Overhaul: The proposal requires new, custom NAND flash commands (Table 2) and a significantly modified SSD controller to manage the complex, multi-step IVF search process (Section 4.4.2). This includes orchestrating data movement between latches, triggering in-plane computations, and managing custom data structures like the TTL in DRAM. This is a substantial firmware and controller logic modification, not a simple software overlay.
Unrealistic System-Level Requirements: The proposed database layout imposes impractical constraints on the storage system.
- Physical Contiguity: The coarse-grained access scheme requires that the database regions for embeddings and documents be physically contiguous (Section 4.1.4 and Section 8). On a multi-terabyte SSD that undergoes wear-leveling, garbage collection, and bad block management, ensuring and maintaining large contiguous physical blocks is operationally infeasible. The paper dismisses this as a one-time "upfront overhead," but this fundamentally conflicts with how modern FTLs manage flash media over their lifetime.
- FTL Simplification: The paper proposes a lightweight "R-DB" mapping structure to reduce DRAM footprint, effectively bypassing a conventional page-level FTL for the RAG database. However, it fails to adequately address how critical flash management tasks (e.g., bad block retirement, read disturb mitigation, wear-leveling) would be handled within these large, statically mapped regions. Stating that maintenance operations are "rare" (Section 4.1.4) is insufficient for a system intended for reliable deployment.
Insufficient Algorithmic Justification and Analysis:
- Dismissal of HNSW: The authors discard graph-based algorithms like HNSW based on the high-level argument of "irregular data access patterns" (Section 4.2). This justification is superficial. They provide no simulation or detailed analysis of how HNSW would actually perform on their ISP architecture. The CPU-based comparison in Figure 5 is irrelevant for proving its unsuitability for an ISP design. It is plausible that caching the upper layers of an HNSW graph in the SSD's DRAM could yield competitive performance.
- Distance Filtering Sensitivity: The performance of REIS leans heavily on the efficacy of Distance Filtering (DF), which is shown to provide the largest single optimization boost (Section 6.3, Figure 9). However, the selection of the filtering threshold appears heuristic. The claim that it "only weakly depends on the dataset size" (Section 4.3.3) is a strong one, yet it is supported by tests on only four datasets. This critical parameter’s robustness across different data modalities, embedding models, and query distributions is not proven.
Reliability Concerns Are Understated: The decision to disable ECC for the binary embeddings is a critical point of failure. The paper’s entire justification rests on Enhanced SLC Programming (ESP) achieving zero Bit Error Rate (BER). While ESP improves margins, claiming 0 BER over a drive's full lifetime, across variations in temperature and P/E cycles, is highly optimistic and lacks sufficient supporting data for the specific context of performing in-die computations. A single uncorrected bit-flip in an embedding could silently corrupt a search result.

Questions to Address In Rebuttal

Please provide a detailed cost-benefit analysis to justify the "no hardware modification" claim. Specifically, quantify the effective cost increase ($/GB) from sacrificing TLC capacity for the SLC partition. Furthermore, characterize the engineering complexity of implementing the new NAND commands and the bespoke IVF control logic in the SSD controller firmware.
Address the physical contiguity requirement in a real-world scenario. How would REIS handle database deployment on a partially filled, fragmented drive? What is the performance overhead of the defragmentation process required to create these contiguous regions, and how does this affect the total time-to-solution?
Provide a more rigorous, quantitative argument for choosing IVF over HNSW specifically for an ISP architecture. A simple reference to access patterns is insufficient. This requires at least a detailed simulation of HNSW’s data access patterns at the NAND channel/die/plane level to demonstrate that it would indeed lead to under-utilization.
Substantiate the claim that ESP can guarantee 0 BER for the SLC partition over the entire operational lifetime (e.g., 5 years, specified DWPD) of an enterprise SSD. Please provide either experimental data under accelerated aging or citations to literature that validates this specific claim in the context of forgoing ECC.
How is the Distance Filtering threshold determined for a new, arbitrary dataset? Please provide evidence of its robustness and performance impact on datasets with fundamentally different distributions from the BEIR benchmarks used in the paper (e.g., non-textual embeddings).

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents REIS, a complete, in-storage processing (ISP) system designed to accelerate the retrieval stage of Retrieval-Augmented Generation (RAG) pipelines. The authors correctly identify that for large knowledge bases, the I/O transfer of embeddings and documents from storage to host memory constitutes a primary performance and energy bottleneck.

The core contribution is a holistic, hardware/software co-designed system that pairs a storage-friendly Approximate Nearest Neighbor Search (ANNS) algorithm (Inverted File, or IVF) with a clever method of execution that leverages existing, unmodified computational logic within the NAND flash dies of a modern SSD. This is complemented by a tailored database layout that efficiently links embeddings to their corresponding documents and manages data placement to maximize internal parallelism. By moving the entire retrieval process (both search and document fetching) into the storage device, REIS demonstrates an order-of-magnitude improvement in performance (avg. 13x) and energy efficiency (avg. 55x) for the retrieval stage over a high-end server baseline.

Strengths

The true strength of this paper lies in its synthesis of ideas from disparate domains—information retrieval, computer architecture, and storage systems—into a single, cohesive, and remarkably practical solution.

Problem Significance and Framing: The paper tackles an exceptionally timely and important problem. As LLMs become ubiquitous, RAG is emerging as the dominant paradigm for grounding them with factual, up-to-date information. The authors provide clear empirical evidence in Section 3.1 (page 5, Figure 2) that I/O, not computation, is the scaling limiter for RAG, effectively motivating the need for a systems-level solution.
Holistic, End-to-End System Design: This is not merely a paper about accelerating ANNS. It addresses the entire retrieval problem. The authors have thought through the full data path: from the choice of an ISP-friendly algorithm (IVF over HNSW, justified in Section 4.2), to the physical data layout (Section 4.1), the low-overhead linkage of embeddings to documents using the NAND OOB area (Section 4.1.3), and the final return of only the relevant document chunks to the host. This complete vision is rare and highly valuable.
Pragmatism and High Potential for Adoption: The most significant contribution, from a practical standpoint, is the commitment to using existing hardware. The core computational kernel of the REIS ANNS engine (Section 4.3) is built upon repurposing standard peripheral logic in NAND flash dies—using XOR gates for distance calculation (for binary quantized vectors) and fail-bit counters for population counting. This "zero-cost" hardware approach dramatically lowers the barrier to adoption compared to prior academic proposals that require bespoke accelerators or significant modifications to the SSD controller. It transforms the problem from one of hardware design to one of firmware and system software, which is a much more tractable path to real-world impact.
Connecting Architectural Principles to Application Needs: The work serves as a powerful case study in the value of near-data processing. It demonstrates a deep understanding of the internal architecture of modern SSDs—leveraging channel, die, and plane-level parallelism—and directly maps these architectural features to the needs of a cutting-edge AI workload.

Weaknesses

The weaknesses of the paper are less about flaws in the core idea and more about the boundaries and future implications of the proposed system, which could be explored more deeply.

Implicit Assumption of a Static Knowledge Base: The proposed database layout, particularly the coarse-grained access scheme (Section 4.1.4) and the reliance on physical data contiguity, is highly optimized for a read-only or infrequently updated dataset. The paper would be strengthened by a more thorough discussion of the challenges of handling dynamic RAG databases where new documents are frequently added, updated, or deleted. The proposed defragmentation would be a significant overhead in such scenarios.
Specialization vs. Generality: The REIS engine is exquisitely tuned for the IVF algorithm on binary quantized embeddings. This tight co-design is its strength, but also a potential weakness. What is the path forward if retrieval techniques evolve? For instance, if future research demonstrates a clear superiority of graph-based methods even in an ISP context, or if higher-precision vectors are required, it's unclear how the REIS framework would adapt without the very hardware modifications it so successfully avoids.
Interaction with Standard SSD Management: The paper mentions that REIS operates in an exclusive "RAG-mode" to simplify the design and avoid interference with normal FTL operations like garbage collection (Section 7.2). While a pragmatic choice, this raises questions about the cost of context switching between modes and the performance implications for mixed-workload environments where the SSD must serve both RAG queries and traditional I/O requests.

Questions to Address In Rebuttal

Could the authors elaborate on the cost and complexity of updating a REIS-managed database? What would be the performance impact of frequent appends or updates, and can the reliance on physical contiguity be relaxed without sacrificing too much performance? An alternative linkage mechanism is briefly mentioned in Section 8, but its trade-offs are not fully explored.
The "exploit latent computation" approach is the paper's most brilliant aspect. Beyond the specific mapping of Hamming distance to XOR and popcount logic, have the authors considered what other computational primitives might be hiding in plain sight within SSDs? Does this work suggest a new direction for SSD design, where manufacturers might formally expose a small set of simple, parallel primitives (e.g., bitwise AND/OR, simple comparisons) for general-purpose, in-situ computation?
Regarding the "RAG-mode" vs. "normal-mode" operation, what is the anticipated latency for loading the respective FTL metadata and switching between these modes? In a multi-tenant cloud environment, how would the system arbitrate between a high-priority RAG query and an incoming write operation that might trigger garbage collection?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper introduces REIS, a retrieval system designed to accelerate the retrieval stage of Retrieval-Augmented Generation (RAG) pipelines via In-Storage Processing (ISP). The authors' central claim to novelty rests on a cohesive framework of three core mechanisms: (1) an ISP-tailored implementation of the cluster-based Inverted File (IVF) ANNS algorithm, chosen for its regular access patterns which are amenable to NAND flash architecture; (2) a hardware-less ANNS computation engine that repurposes existing SSD peripheral logic (latches, fail-bit counters) to perform binary distance calculations without adding new hardware; and (3) a novel database layout that links embeddings to documents using the NAND flash Out-Of-Band (OOB) area and employs a lightweight FTL to reduce host-side overhead. The work positions itself as the first complete ISP-based RAG retrieval system that avoids the pitfalls of prior ISP-ANNS accelerators, namely the use of ISP-unfriendly algorithms and the introduction of significant hardware modifications.

Strengths

The primary strength of this work lies in its specific, technically deep novel contributions that are elegantly tailored to the constraints of existing hardware.

Novel Repurposing of Existing Hardware for ANNS: The most significant novel idea is the in-storage ANNS engine detailed in Section 4.3 (Page 9). While prior work has proposed ISP for ANNS, those works typically involve adding dedicated accelerators (e.g., DeepStore [192]) or have different computational models. REIS’s proposal to perform XOR and population count (for Hamming distance on binary embeddings) by repurposing the existing Sensing Latch (SL), Cache Latch (CL), Data Latch (DL), and the fail-bit counter is a genuinely clever and non-intrusive approach. This "zero-hardware-modification" principle is a powerful and novel contribution in the domain of computational storage.
Novel Algorithm-Hardware Co-Design: The explicit choice of the IVF algorithm over the more commonly accelerated graph-based algorithms like HNSW (used in NDSearch [299]) is a key element of novelty. The authors correctly identify that HNSW’s irregular, pointer-chasing access patterns are a poor fit for the highly parallel but block-oriented access of a modern SSD. Their justification in Section 4.2 (Page 8) for selecting IVF due to its contiguous, streaming-friendly access patterns represents a novel insight in the co-design of ANNS algorithms and storage-level processing.
Novel System-Level Integration for RAG: The paper proposes a complete system, not just an ANNS accelerator. The mechanism for linking embeddings to their corresponding documents directly within the storage device using the OOB area (Section 4.1.3, Page 7) is a novel solution to the often-overlooked document retrieval part of the RAG pipeline. This integration ensures that the benefits of in-storage search are not lost to a subsequent, slow document fetch operation.

Weaknesses

My critique is focused on contextualizing the novelty and questioning the practical generality of some of the proposed mechanisms. The core ideas are strong, but their foundations rest on enabling technologies that are not themselves novel.

Foundational Mechanisms Are Not Novel: The core mechanism of performing bulk bitwise operations inside a NAND flash die is not new. The authors' own citation, Flash-Cosmos [224], is the foundational work that demonstrated this capability. REIS appears to be a novel and compelling application of this pre-existing technique to the ANNS problem (specifically, using XOR for Hamming distance). The authors should be more precise in claiming their contribution is the specific application and system integration, not the invention of in-flash bitwise computation.
Component Ideas Lack Originality: Similarly, the concepts of Hybrid SSDs (using SLC for performance-critical data and TLC for capacity) and leveraging the OOB area for metadata are established concepts in SSD design. While their application here—SLC for embeddings, TLC for documents, and OOB for document pointers—is novel in the context of RAG, the underlying architectural ideas are prior art.
Contiguity Requirement is a Significant Caveat: The proposed coarse-grained access scheme (Section 4.1.4, Page 7) relies on storing database regions in physically contiguous blocks to enable a lightweight FTL. This is a very strong assumption that is difficult to guarantee in a real-world, dynamic storage system that suffers from fragmentation. This practical limitation may reduce the novelty of the FTL optimization, as it is only applicable under idealized conditions that are not typical of general-purpose storage.

Questions to Address In Rebuttal

Please clarify the precise delta between the in-plane computation proposed in REIS and the foundational work in Flash-Cosmos [224]. Is the contribution the specific sequence of operations (Input Broadcasting, XOR, popcount) to calculate Hamming distance for ANNS, rather than the underlying mechanism of in-flash computation itself?
The selection of IVF is justified by its streaming-friendly access pattern. Is this contribution the specific choice of IVF, or the more general principle of selecting ANNS algorithms with high data locality for ISP? Have other streaming-friendly algorithms (e.g., certain forms of LSH) been considered, and if so, why was IVF superior?
The reliance on physical data contiguity (Section 4.1.4, Page 7 and Section 8, Page 14) is a major practical concern. How does the system handle a database update that leads to fragmentation? Does the need to perform potentially costly defragmentation operations negate the performance benefits of the lightweight FTL, thereby limiting the novelty of this specific optimization to write-once, read-many scenarios?

MicroScopiQ: Accelerating Foundational Models through Outlier-Aware Microscaling Quantization

Abstract

Quantization of foundational models (FMs) is significantly more challenging than traditional DNNs due to the emergence of large magnitude values called outliers. Existing outlier-aware algorithm-architecture co-design techniques either use mixed-precision,...

ACM LINK

Reviews

Review 1

Excellent. I will now embody the persona of "The Guardian" to deliver a rigorous, skeptical peer review of the paper "MicroScopiQ."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose MicroScopiQ, a co-design technique for quantizing foundational models that combines structured pruning with outlier-aware quantization. The central idea is to quantize outliers to a higher precision (using the MX floating-point format) and inliers to a lower precision (MX integer format). To maintain memory alignment and a consistent bit-budget, the additional bits required for the high-precision outliers are stored in the locations of the least important inlier weights, which are pruned using Hessian information. The authors present a hardware accelerator architecture featuring a "Redistribution and Coordination NoC" (ReCoN) to manage the reordering of these distributed outlier bits during computation.

While the proposed method shows promising accuracy results, particularly at ultra-low bit-widths, the evaluation of the hardware claims rests on several questionable assumptions and a potentially biased comparison methodology. These issues severely undermine the paper's central claims of superior performance and efficiency over existing state-of-the-art methods.

Strengths

Accuracy at Low Precision: The quantization algorithm demonstrates strong empirical performance, particularly in the W2A16 setting (Table 2, page 9). Achieving perplexity scores like 8.43 for LLaMA-2 13B at this bit-width is a notable result and suggests the core quantization methodology is effective at preserving model quality.
Conceptual Framework: The core concept of utilizing pruned weight locations to store outlier information as a means to enforce memory alignment is a novel approach to the mixed-precision quantization problem. It directly addresses a known trade-off.
Component Ablation: The ablation study presented in Table 7 (page 12) is methodical and provides a clear view of how each algorithmic component (MX format choice, outlier magnitude reduction, Hessian-based updates) contributes to the final accuracy.

Weaknesses

Fundamentally Biased "Iso-Accuracy" Hardware Comparison: The paper’s headline claims of "3x faster inference and 2x lower energy" are derived from the "iso-accuracy" comparison in Figure 12 (page 11). This comparison is methodologically unsound. The authors compare their highly-optimized mixed-precision configuration (MicroScopiQ-v2, which is mostly 2-bit) against baselines like Olive and GOBO, which are likely evaluated in their default, uniform 4-bit or 8-bit configurations. A 2-bit design will inherently be faster and more energy-efficient than a 4-bit design. The correct experiment would be to configure the baseline methods in a similar mixed-precision setup to achieve the same accuracy target. Without this, the performance gains shown are not an apples-to-apples comparison of architectural novelty but rather a trivial consequence of using a lower average bit-width.
Overstated and Misleading Hardware Claims on GPUs: The GPU evaluation in Table 6 (page 12) is highly problematic. The results on an actual A100 GPU ("W4A4 MS optim.") show performance that is, at best, on par with the Atom baseline (1.01x for LLaMA-3 8B) and significantly underperforms FP16. The dramatic speedup figures (e.g., 1.78x) are only achieved in a simulation of a GPU with a hypothetical, modified tensor core ("w/ New MTC"). Presenting simulated results from non-existent hardware as a primary performance metric is misleading. The paper does not demonstrate a practical advantage on current hardware.
Under-analyzed ReCoN Overhead and Contradictory Claims: The authors claim the ReCoN NoC has "minimal overhead" (Section 5.1, pg 6) and that access conflicts are under 3% (Section 7.8, pg 13). This is contradicted by their own data in Figure 18a (page 14), which shows that moving from 1 ReCoN unit to 8 units (thereby eliminating contention) results in a 21% latency improvement. A component whose contention causes a 21% performance loss cannot be described as having minimal overhead. The analysis lacks a detailed breakdown of the latency penalty per conflict, the complexity of the arbitration logic, or the area/power cost of the cross-row routing channels.
Opaque Metadata Management Costs: The entire scheme relies on a "permutation list" to correctly identify and reassemble the distributed outlier bits. While its size is factored into the EBW calculation (Section 4.4, pg 5), the practical hardware cost is not analyzed. What is the required size of the on-chip Instruction Buffer to hold this metadata? What is the bandwidth required to stream these permutation lists to the ReCoN controllers cycle-by-cycle? If a layer has a high density of non-local outliers, this metadata traffic could become a significant bottleneck, which the paper completely ignores.
Subjective and Self-Serving Initial Comparison: Table 1 (page 1) is not an objective academic comparison. It uses qualitative and unquantified labels like "Simple" PE design, "Low" HW overhead, and a binary "Yes/No" for "Flexibility". This table frames the problem space in a way that conveniently positions the authors' work as the only viable solution, which is inappropriate for a rigorous scientific paper.

Questions to Address In Rebuttal

Regarding the iso-accuracy comparison in Figure 12: Can you provide a new comparison where the baseline accelerators (e.g., Olive, GOBO) are also configured in a layer-wise mixed-precision mode (e.g., using both 2-bit and 4-bit layers) to match the exact accuracy of MicroScopiQ-v2? If not, how can you defend the claim that your architecture is superior, rather than just your quantization recipe enabling a lower average bit-width?
Regarding the GPU results in Table 6: Given that the real-hardware implementation shows no meaningful throughput improvement over existing optimized kernels, please justify the paper's strong claims of "accelerating" foundational models. Should the claims in the abstract and conclusion be revised to state that the benefits are contingent on future, hypothetical hardware changes?
Please reconcile the contradiction regarding ReCoN's overhead. How can a component with a 21% latency impact due to contention (as shown in Figure 18a) be considered to have "minimal overhead"? Provide a detailed cycle-level analysis of the performance penalty incurred when a PE row stalls while waiting for ReCoN access.
Please provide a quantitative analysis of the permutation list metadata. For a model like LLaMA-3 70B, what is the total size of this metadata? What is the required on-chip buffer size and read bandwidth from this buffer to sustain 100% utilization of the PE array, assuming worst-case (but realistic) outlier distributions?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces MicroScopiQ, a novel algorithm-architecture co-design that addresses the fundamental trade-off between accuracy and hardware efficiency in quantizing Foundational Models (FMs). The key challenge in this domain is the presence of large-magnitude outliers, which existing methods handle by either retaining them at high precision (compromising hardware efficiency and memory alignment) or quantizing them to the same low precision as other values (compromising accuracy).

MicroScopiQ proposes an elegant third path. Its core innovation is to leverage structured pruning not primarily for model compression, but as a mechanism to create a "bit budget" to represent critical outlier values at higher precision. Specifically, for each outlier that needs extra bits for a high-precision representation (e.g., 4-bit MX-FP), the method identifies and prunes a corresponding least-important "inlier" weight (using Hessian information). The memory location of this pruned weight is then repurposed to store the extra bits (the LSBs) of the outlier. This masterstroke allows the model to maintain a uniform, dense, and aligned memory layout from the hardware's perspective, while logically storing outliers at higher precision. To manage the required data reorganization at runtime, the authors propose a low-overhead, time-multiplexed Network-on-Chip called ReCoN. The result is a system that achieves the accuracy benefits of mixed-precision approaches while retaining the hardware simplicity and efficiency of uniform quantization.

Strengths

Elegant Core Concept and Synthesis of Ideas: The central contribution of this work is the conceptual reframing of the relationship between pruning and quantization. Instead of viewing them as two separate, and sometimes conflicting, compression techniques, the authors use one to directly enable the other. The idea of "pruning for bit redistribution" is a sophisticated and powerful synthesis that elegantly sidesteps the primary dilemma in outlier-aware quantization. It connects the fields of model pruning, quantization, and hardware architecture in a novel and synergistic way.
Addressing a Critical and Well-Defined Bottleneck: The paper does an excellent job of situating itself within the current research landscape. The categorization of prior art into two groups (as seen in Table 1, page 1)—those that sacrifice hardware efficiency for accuracy (Group A) and those that sacrifice accuracy for efficiency (Group B)—is insightful and accurately frames the core problem. MicroScopiQ is presented not as an incremental improvement, but as a genuine attempt to resolve this "mutual exclusivity," a goal of significant importance to the field of efficient AI.
Holistic Algorithm-Architecture Co-Design: This work is a prime example of successful co-design. The algorithm is not developed in a vacuum; it is designed with hardware realizability as a first-class constraint. The choice of MX data formats, the structured pruning pattern, and the redistribution of bits are all motivated by the goal of enabling a simple, homogeneous INT-based PE array. The ReCoN NoC is the crucial architectural piece that makes the algorithm's data-shuffling requirements practical, demonstrating a deep understanding of the interplay between software and hardware.
Strong and Comprehensive Empirical Validation: The authors provide a robust evaluation across a wide spectrum of models (LLMs like LLaMA, VLMs like OpenFlamingo, and even CNNs/SSMs), quantization settings (W4/A16, W2/A16, W4/A4, etc.), and tasks. The consistent outperformance against a strong suite of baselines, including recent SOTA methods like OmniQuant and specialized co-designs like Olive, convincingly demonstrates the effectiveness of the proposed technique. The architectural simulations and ablations (Section 7, pages 9-14) further bolster the claims of efficiency and low overhead.

Weaknesses

Limited Contextualization of the ReCoN Architecture: While the paper claims ReCoN is a "novel" NoC, its functionality—data permutation and combination based on control signals—shares principles with classical switching networks (e.g., butterfly, Benes networks, which are cited in other contexts in the related work). The paper would be strengthened by more explicitly placing ReCoN within the broader literature of on-chip networks for data reorganization. Is ReCoN a specialized application of known principles, or does it introduce fundamentally new routing or flow control mechanisms? A deeper discussion would help clarify its architectural contribution.
Under-explored Sensitivity to Outlier Characteristics: The entire premise relies on the number of outliers being relatively small (e.g., <5% as shown in Figure 2a, page 3), such that an equal number of inliers can be pruned without catastrophic accuracy loss. The paper demonstrates this holds for current FMs. However, it would be valuable to discuss the conceptual limits of this approach. What happens in a hypothetical future model where outliers are more dense or clustered? At what point does the accuracy degradation from pruning overwhelm the gains from representing outliers faithfully? A stress test or a more theoretical discussion on these boundary conditions would add significant depth.
Positioning Relative to Structured Sparsity: The final weight matrix, with its pruned locations, is effectively a form of structured sparsity. The paper primarily contrasts its approach with mixed-precision and uniform quantization. However, it would be illuminating to compare and contrast MicroScopiQ with other hardware approaches for structured sparsity, such as NVIDIA's 2:4 sparsity support in Tensor Cores. While the goals are different (MicroScopiQ uses sparsity to enable precision, not just for FLOP reduction), the underlying hardware challenges of handling non-dense data have parallels that are worth exploring.

Questions to Address In Rebuttal

Could the authors elaborate on the novelty of the ReCoN NoC architecture in the context of prior NoC designs for data reorganization and permutation (e.g., butterfly or Benes networks)? While the overhead is shown to be low, a deeper analysis of its scalability and control complexity, especially for very wide PE arrays, would be beneficial.
The effectiveness of MicroScopiQ hinges on pruning a number of inliers roughly equal to the number of outliers. What are the empirical limits of this approach as outlier density increases? Have you tested models or layers with unusually high outlier ratios (>5%) to see where the trade-off between pruning-induced error and quantization error begins to break down?
The proposed method creates a unique form of structured sparsity to store higher-precision data. How does the performance and hardware complexity of the ReCoN-based approach for handling this implicit sparsity compare to architectures designed to explicitly handle other forms of structured sparsity, such as NVIDIA's 2:4 pattern?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper introduces MicroScopiQ, a co-design methodology for quantizing Foundational Models (FMs). The authors identify a key trade-off in existing outlier-aware quantization schemes: high-precision outlier storage (e.g., GOBO [99]) harms hardware efficiency and memory alignment, while uniform low-precision quantization (e.g., Olive [29]) harms accuracy.

The authors claim two primary novel contributions to resolve this: 1. Algorithmic Contribution: A technique that combines Hessian-based pruning with quantization. Instead of merely removing weights, the method prunes the least salient inlier weights to create "bit-space". This space is then used to store the least significant bits (LSBs) of outliers, which are themselves quantized to a higher precision (e.g., 4-bit outliers in a 2-bit weight matrix). The core novelty is this "pruning-for-storage" mechanism, which maintains a uniform bit-width per tensor element, thereby ensuring memory alignment. 2. Architectural Contribution: A custom accelerator featuring a specialized Network-on-Chip (NoC) called ReCoN (Redistribution and Coordination NoC). ReCoN is a time-multiplexed butterfly network designed to intercept the outputs from PEs processing distributed outlier data, reassemble the full-precision partial products, and forward the correct result to the next stage. This architectural pattern claims novelty by centralizing the complexity of handling mixed-format data, allowing the core Processing Element (PE) array to remain simple, homogeneous, and INT-based.

Strengths

The primary strength of this paper lies in the cleverness of its central idea.

Novelty of the "Pruning-for-Storage" Concept: The core concept of pruning weights not for sparsity acceleration but to serve as a storage medium for the excess bits of higher-precision values is, to my knowledge, a genuinely novel approach. It provides an elegant conceptual bridge between mixed-precision accuracy and uniform-precision hardware efficiency. It directly addresses the memory alignment problem that plagues sparse/mixed-precision formats like that in GOBO [99].
Novel Architectural Pattern for Abstraction: While the components of the accelerator are not entirely new (systolic arrays, multi-precision PEs, butterfly networks), their synthesis into the proposed architecture is. The specific use of a NoC (ReCoN) to offload and manage the re-materialization of distributed floating-point values is a novel architectural pattern. It effectively abstracts the complexity of the data format away from the PE array, which is a significant conceptual departure from prior work like Olive [29], which places this complexity inside each PE.
Refinement over Adjacent Prior Art: The proposed method is a clear and non-trivial advancement over its closest conceptual neighbors. Unlike Olive [29], which prunes values physically adjacent to outliers, MicroScopiQ uses saliency-based pruning (Hessian), which is functionally superior. Unlike SDQ [37], which also combines pruning and quantization, MicroScopiQ does not decompose the tensor into separate sparse vectors but rather performs an "in-place" redistribution of bits within a single, dense tensor representation. This delta is significant.

Weaknesses

While the core idea is novel, the paper's claims of novelty could be more precise by situating them more rigorously against the backdrop of prior art.

Overlapping Concepts in Pruning + Quantization: The idea of combining pruning and quantization is not new. SDQ [37] is a recent pre-print that proposes "Sparse Decomposed Quantization," decomposing weights into inlier and outlier vectors that are quantized differently and stored sparsely. The authors mention SDQ, but the novelty of their own approach—namely the "in-place" bit redistribution within a unified tensor format versus SDQ's decomposition into two separate vectors—should be made more explicit as the key differentiator. The current description frames SDQ as merely having "limited outlier flexibility," which undersells the fundamental structural difference that constitutes MicroScopiQ's novelty.
Architectural Primitives are Not New: The paper should be more careful in attributing novelty. ReCoN is described as a "multistage butterfly NoC" (Section 5.4, page 7). Butterfly networks are a classic topology for permutation and sorting, and their use in accelerator NoCs has been explored (e.g., for data layout transformation in [90]). The novelty is not the topology itself, but rather the specific functionality of the ReCoN switch (Figure 7(c), page 7), particularly the "Merge" operation which is custom-designed to reconstruct FP partial sums from the distributed mantissa chunks. The authors should sharpen their claim to focus on the functional novelty of the switch logic, not the topology.
Potentially Misleading Terminology: In Section 4.3 (page 5), the authors describe their pruning pattern as (Βμ-η):Βμ structured pruning. This terminology is potentially confusing. The term "N:M structured pruning" typically refers to a fixed, regular pattern (e.g., 2:4) that hardware can exploit directly for computation. Here, n (the number of outliers) is data-dependent, making the pattern dynamic and irregular from one micro-block to the next. The structure exists for storage and redistribution, not for direct computational acceleration in the vein of NVIDIA's sparse tensor cores. This distinction is critical and the chosen terminology clouds it.

Questions to Address In Rebuttal

Clarification vs. SDQ [37]: The core algorithmic novelty rests on the "pruning-for-storage" idea. Please move beyond the qualitative description and explicitly contrast this with the sparse decomposition in SDQ. Is the primary benefit the elimination of index storage and random memory access? Please articulate the fundamental conceptual delta that makes your contribution non-obvious in light of SDQ.
Defining ReCoN's Novelty: Is the primary novel contribution of ReCoN its butterfly topology or the specific logic within its switches designed to re-materialize FP values from INT components? If the latter, please confirm that this specific Merge functionality, which accounts for mantissa shifting and the implicit 1.0 hidden bit, is without precedent in prior accelerator designs.
Justification for Architectural Complexity: The ReCoN unit introduces a non-trivial, time-multiplexed, multi-stage network into the datapath, which adds latency and area, however small. What alternative, simpler architectural designs were considered? For example, could a specialized functional unit attached to the PE array's output bus perform the same outlier reassembly without the routing complexity of a full NoC? Please justify why this NoC-based approach represents a more novel and effective solution compared to simpler alternatives.
On the "N:M Structured Pruning" Terminology: Please address the potential for confusion with the standard definition of N:M sparsity. Acknowledge that this pattern is dynamic and content-dependent. Would a different term, such as "Sub-block Excision" or "Dynamic Bit-Compaction Pruning," more accurately describe the mechanism without creating a false equivalency to hardware-accelerated fixed sparsity patterns?

Topology-Aware Virtualization over Inter-Core Connected Neural Processing Units

Abstract

With the rapid development of artificial intelligence (AI) applications, an emerging class of AI accelerators, termed Inter-core Connected Neural Processing Units (NPU), has been adopted in both cloud and edge computing environments, like Graphcore IPU, ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the perspective of "The Guardian."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents vNPU, a virtualization framework for inter-core connected Neural Processing Units (NPUs), a class of accelerators characterized by a hardware topology and direct inter-core communication. The authors argue that existing virtualization techniques for CPUs and GPUs are ill-suited for this architecture. They propose three core techniques: 1) vRouter for virtualizing instruction dispatch and the Network-on-Chip (NoC), 2) vChunk, a range-based memory virtualization mechanism to replace traditional paging, and 3) a "best-effort" topology mapping algorithm based on graph edit distance to improve resource utilization when exact topology matches are unavailable. The system is evaluated via an FPGA prototype and a simulator, claiming significant performance improvements over a re-implemented MIG-like baseline.

Strengths

Problem Identification: The paper correctly identifies a salient problem. The architectural paradigm of inter-core connected NPUs is fundamentally different from traditional SIMT accelerators like GPUs, and the authors make a clear case that topology and data flow are first-class citizens that existing virtualization mechanisms ignore.
Core Concepts: The proposed high-level concepts are logical responses to the identified architectural challenges. Using a routing table for core redirection (vRouter) and a range-based translation for memory (vChunk) are sound design choices for the described hardware and workload characteristics.
Evaluation Platform: The use of both an FPGA-based platform (Chipyard+FireSim) for micro-architectural validation and a simulator (DCRA) for larger-scale experiments is a methodologically sound approach.

Weaknesses

My primary concerns with this work lie in the rigor of the evaluation and the practical implications of the proposed solutions, which appear to be insufficiently stress-tested.

The MIG Baseline Appears to be a Straw Man: The central performance claim hinges on the comparison against a "MIG-based virtual NPU." This is not an industry-standard MIG but an authors' re-implementation based on the concept of "fixed partitions" (Section 6.1). This approach is prone to confirmation bias. The most striking result (up to 1.92x improvement, Section 6.3.2) is demonstrated in a scenario where a GPT-large model requires 36 cores, while the largest available MIG partition is only 24 cores. This forces the MIG baseline into a time-division multiplexing (TDM) penalty by design. While this demonstrates the flexibility of vNPU, it does not represent a fair comparison of architectural overheads but rather a comparison of flexible partitioning versus fixed partitioning. The claimed performance gain is a direct consequence of an allocation scenario that maximally penalizes the baseline.
Ambiguous Claims of Performance Isolation: The paper claims its design can provide strong isolation, but the details of vRouter for the NoC suggest otherwise. In Section 4.1.2, the authors present two routing strategies: one using default Dimension-Order Routing (DOR) which "may lead to potential performance interference," and another "predefining the routing direction inside the routing table." This second option prevents packets from being routed to the wrong tenant's cores, but it does not prevent performance interference due to network congestion. If two virtual NPUs are mapped to physically adjacent cores, their NoC traffic will still contend for shared physical links and router buffers. The paper fails to quantify this residual interference, making its isolation claims tenuous.
Unaccounted Computational Overhead for Topology Mapping: The paper proposes a topology mapping algorithm based on graph edit distance, which is noted to be NP-hard (Section 4.3). The authors mention pruning strategies, but the evaluation completely omits the computational cost of this allocation algorithm. When a user requests a virtual NPU, the hypervisor must execute Algorithm 1 to find a suitable physical mapping. What is the latency of this process? For a large NPU with many free cores, the number of candidate subgraphs could be enormous. This allocation latency is a critical component of the system's "warm-up time" and its absence from the evaluation (Figure 16 only measures data loading) is a significant omission.
Weak Memory Virtualization Baseline: The vChunk mechanism is compared against a page-based IOTLB with 4 and 32 entries (Figure 14). An IOTLB with only 4 entries is not a realistic baseline for a high-performance accelerator. While the conclusion that range-based translation is superior for coarse-grained DMA is likely correct, the reported 20% overhead for the page-based system may be artificially inflated by comparing against an under-provisioned and poorly characterized baseline.

Questions to Address In Rebuttal

Regarding the MIG Baseline: Can the authors justify that their MIG-based baseline is a fair and representative model? Please provide results for a scenario where both vNPU and the MIG baseline can satisfy the core-count requirement without resorting to TDM, to provide a more direct comparison of the architectural overheads.
Regarding NoC Isolation: Please clarify the exact guarantees of performance isolation. Can you quantify the potential for performance degradation due to NoC link/router contention between co-located virtual NPUs, even when using the predefined routing path strategy?
Regarding Topology Mapping Overhead: What is the real-world latency of executing the topology mapping algorithm (Algorithm 1) in the hypervisor during a virtual NPU allocation request? Please provide data on how this latency scales with the size of the physical NPU and the number of free cores.
Regarding the IOTLB Baseline: Could you provide more details on the configuration of the page-based IOTLB baseline used in Figure 14? Specifically, what are its lookup latency, miss penalty, and page-walking mechanism? How was the choice of 4 and 32 entries justified?

Review 2

Excellent, here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents vNPU, a comprehensive virtualization framework for an emerging and important class of AI accelerators: Inter-core Connected Neural Processing Units (NPUs). These accelerators, exemplified by architectures like Graphcore's IPU and Tenstorrent, depart from the traditional model of a symmetric pool of compute units (as seen in GPUs) and instead leverage a hardware-defined topology and a dataflow execution model.

The authors correctly identify that existing virtualization techniques, developed for CPUs and GPUs, are fundamentally insufficient for these new architectures because they are "topology-oblivious." The core contribution of this work is to introduce the concept of topology-aware virtualization. The authors build a full-stack solution around this concept, comprising three key ideas: (1) vRouter, a hardware mechanism to virtualize the instruction and data flow by redirecting traffic between virtual and physical NPU cores according to a virtual topology; (2) vChunk, a specialized memory virtualization scheme tailored to the coarse-grained, streaming DMA access patterns of NPUs, avoiding the overhead of traditional page-based translation; and (3) Best-effort Topology Mapping, an algorithmic approach to resource allocation that finds "good enough" physical core layouts for requested virtual topologies, balancing utilization and performance. The work is evaluated through a combination of FPGA prototyping and simulation, demonstrating significant performance gains over topology-oblivious (UVM-based) and rigidly partitioned (MIG-based) approaches.

Strengths

Timeliness and Novelty of the Core Problem: The paper is exceptionally well-timed. As the industry moves towards specialized, large-scale, spatially-programmed accelerators, the question of how to efficiently share them in multi-tenant cloud environments becomes paramount. This work is, to my knowledge, one of the first to formally identify and address the unique virtualization challenges posed by the dataflow nature and explicit topology of these devices. It moves the conversation beyond simply partitioning resources to virtualizing the very fabric of communication that makes these accelerators powerful.
A Coherent and Conceptually Sound Framework: The authors have not just identified a problem; they have proposed a clean, coherent set of abstractions to solve it. The division of labor between vRouter (handling the spatial/topological aspects) and vChunk (handling the memory access aspects) is logical and directly maps to the architectural novelties of these NPUs. This provides a valuable conceptual blueprint for future work in this domain.
Connecting Systems Architecture with Theoretical Computer Science: A particular strength is the application of the graph edit distance algorithm to the NPU core allocation problem (Section 4.3, page 7). This is a wonderful example of bridging a practical systems problem (how to place a virtual topology onto a fragmented physical one) with a well-understood concept from graph theory. While others might have opted for a simpler, greedy heuristic, this approach shows a deeper level of thinking about the problem's fundamental structure and opens the door to more sophisticated allocation strategies.
Strong Contextualization and Motivation: The paper does an excellent job of situating itself within the broader landscape of accelerator virtualization. The introduction and background sections (Sections 1 and 2) clearly articulate why existing methods like NVIDIA's MIG or academic proposals like Aurora fall short. The argument that these new NPUs are not just "more powerful GPUs" but a different architectural paradigm is well-made and provides a strong justification for the novel techniques proposed.

Weaknesses

While this is a strong and foundational paper, its primary weakness lies in the assumptions that underpin its design, which may limit its generality in the face of a rapidly evolving AI landscape.

Dependence on Predictable Workload Behavior: The vChunk design is highly optimized for the specific memory access patterns the authors identify as typical for ML models (monotonic, iterative tensor access, as described in Section 4.2, page 6). This is clever, but it ties the effectiveness of the memory virtualization scheme to a particular class of workloads. As the authors themselves note in the discussion (Section 7, page 13), workloads with more irregular memory access patterns, such as Graph Neural Networks (GNNs) or sparse models, would likely challenge this design and may perform better with traditional paging. The work would be stronger if it explored a hybrid approach or more deeply analyzed the break-even point.
Scalability of the Topology Mapping Algorithm: The use of topology edit distance is elegant, but its NP-hard nature raises practical concerns about scalability. In a large-scale cloud datacenter with potentially thousands of cores and a high frequency of allocation/deallocation requests, the latency of the allocation algorithm itself could become a system bottleneck. The paper mentions pruning strategies, but a more thorough analysis of the computational cost and its impact on VM startup times ("warm-up time" in Figure 16, page 11, seems to only cover data loading) would be beneficial.
Security Implications of Spatial Adjacency: The paper focuses primarily on performance isolation (preventing "NoC interference"). However, in a multi-tenant environment, security is equally critical. Placing two mutually distrusting tenants' virtual NPUs in close physical proximity on the NoC could open up new avenues for side-channel attacks (e.g., through timing variations in shared router arbitration). The proposed model does not seem to explicitly account for security domains during placement, which is a crucial aspect for real-world deployment.

Questions to Address In Rebuttal

The vChunk design is very effective for the workloads evaluated. Could the authors elaborate on how the vNPU framework might adapt to workloads with less regular memory access patterns, such as large-scale GNNs? Would it be feasible to support both vChunk and traditional page-based translation, perhaps selectable on a per-VM or per-workload basis?
Regarding the topology mapping algorithm: What is the computational complexity of the proposed allocation process in practice, considering the pruning heuristics? How would this scale to a physical NPU with thousands of cores and a highly dynamic, fragmented state, typical of a mature cloud environment?
Beyond performance isolation, what are the security implications of the topology-aware allocation strategy? Have the authors considered how their mapping algorithm could be augmented to incorporate security constraints, for instance, by maximizing the physical distance or minimizing shared NoC resources between vNPUs belonging to different security domains?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents vNPU, a virtualization framework specifically designed for Inter-core Connected Neural Processing Units (NPUs), a class of data-flow accelerators like the Graphcore IPU or Tenstorrent. The authors' central claim is that this is the first comprehensive design to virtualize not just the compute and memory resources, but the hardware topology of these devices. The work introduces three primary components to achieve this: 1) vRouter, for redirecting instruction and data flows to create a virtual Network-on-Chip (NoC) topology; 2) vChunk, a range-based memory virtualization mechanism optimized for the bursty, DMA-driven memory access patterns of NPUs; and 3) a Topology Mapping Algorithm, which uses graph edit distance to map a user's desired virtual topology onto available, potentially fragmented, physical cores.

Strengths

The primary strength of this paper lies in its precise identification of a gap in prior art and the proposal of a coherent solution.

A Genuinely Novel Problem Formulation: The core idea of "topology-aware virtualization" for data-flow accelerators is, to my knowledge, novel. Prior work in GPU virtualization, such as NVIDIA's MIG, focuses on static, hard partitioning of resources. While effective, it does not allow for the creation of arbitrary, user-defined virtual topologies from a pool of physical resources. Similarly, prior academic work on NPU virtualization (e.g., Aurora [41], V10 [77]) has largely targeted monolithic NPUs, sidestepping the critical challenge of virtualizing the inter-core fabric that defines data-flow architectures. This paper correctly identifies that for this class of hardware, the interconnect is a first-class resource to be virtualized.
Application of Known Concepts to a New Domain: The authors' approach to solving the "topology lock-in" problem (Section 4.3, page 7) is a novel application of established graph theory concepts. While topology/graph edit distance is not a new algorithm, framing the NPU core allocation problem as finding a subgraph with minimum edit distance to the requested topology is an original and insightful way to manage fragmented resources efficiently. This moves beyond simple core counting to a more sophisticated, performance-aware allocation strategy.
Domain-Specific Optimization of an Existing Idea: The vChunk mechanism for memory virtualization is a well-reasoned adaptation. The concept of range-based address translation is not new, as the authors themselves acknowledge by citing prior work [10, 22] on page 6. However, their contribution is the optimization for the specific memory access patterns of NPUs. The introduction of the last_v field to predict the next Range Translation Table (RTT) entry based on the observed iterative nature of ML workloads (Pattern-3, page 6) is a clever, low-cost hardware optimization that directly addresses the shortcomings of generic translation mechanisms in this specialized context.

Weaknesses

The paper's claims of novelty, while largely justified at a high level, could be more precisely delineated from prior art in some areas.

Overstated Novelty of Range-Based Translation: The presentation of vChunk as one of three "novel techniques" in the abstract is an overstatement. The fundamental mechanism is not novel. The novelty lies in the indexing optimization (last_v). The paper should be clearer in distinguishing the adaptation of an existing technique from the invention of a new one. The core idea remains "range-based translation," with a domain-specific lookup optimization.
Unexplored Trade-offs in Virtual NoC Routing: The vRouter for NoC virtualization (Section 4.1.2, page 5) presents two routing strategies for irregular topologies: a default Dimension-Order Routing (DOR) that may cause "NoC interference" between different virtual NPUs, or a pre-defined routing table to confine packets. This introduces a critical trade-off between performance (potentially non-optimal paths in the latter case) and isolation (potential interference in the former). The paper does not sufficiently quantify the performance delta or the severity of interference between these two approaches. The novelty of creating irregular topologies is diminished if the only way to make them work safely is with statically defined, potentially inefficient routes.
Scalability of the Mapping Algorithm: The topology mapping algorithm is based on computing topology edit distance, a derivative of the NP-hard subgraph isomorphism problem. While the paper proposes intuitive pruning strategies (Section 4.3, page 7), the computational complexity of this approach in the hypervisor is not analyzed. For future NPUs with thousands of cores and a high density of tenants requesting diverse topologies, the overhead of this "best-effort" mapping could become a bottleneck. The novelty of the idea must be weighed against its practical scalability.

Questions to Address In Rebuttal

Regarding vChunk: Can the authors please clarify the precise delta of their contribution over prior range-based TLB designs like [10, 22]? Specifically, is the novelty limited to the last_v predictive indexing mechanism, or are there other fundamental architectural differences?
Regarding vRouter and NoC Interference: The paper discusses the risk of NoC interference when using default routing on an irregular virtual topology mapped to physical cores (page 5). Could you provide quantitative data on the performance impact of this interference versus using the pre-defined direction routing? How does the performance of an irregular virtual topology with "safe" but non-optimal routes compare to an ideal, contiguous topology? This is key to understanding the real-world cost of the flexibility you propose.
Regarding the Topology Mapping Algorithm: What is the computational overhead of the proposed mapping algorithm (Algorithm 1, page 8) in the hypervisor? Please provide an analysis or experimental data on how the search time for a suitable topology scales with the number of total physical cores, the number of currently active tenants (i.e., degree of fragmentation), and the size of the requested virtual topology. At what point does this best-effort mapping become prohibitively slow?

Chip Architectures Under Advanced Computing Sanctions

Abstract

The rise of large scale machine learning models has generated unprecedented requirements and demand on computing hardware to enable these trillion parameter models. However, the importance of these bleeding-edge chips to the global economy, technological ...

ACM LINK

Reviews

Review 1

Here is the peer review from the perspective of 'The Guardian'.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present a study on the architectural implications of recent advanced computing sanctions, using a design space exploration (DSE) to model chip performance for LLM inference under these constraints. The paper claims to demonstrate methods for optimizing compliant chip designs and proposes an "architecture-first" approach for crafting more effective, less economically harmful regulations.

While the topic is timely, the work rests on a foundation of significant methodological weaknesses. The core results are derived from a high-level simulation framework whose accuracy for this specific task is not validated within the paper. Furthermore, the study's interpretation of the regulatory constraints is based on a simplified model that may not reflect reality. Finally, the proposed policy solutions are speculative and fail to consider the same adversarial design responses that motivate the paper's existence.

Strengths

Timeliness: The paper addresses a relevant and pressing issue at the intersection of computer architecture and international policy.
Problem Formulation: The work correctly identifies a key tension: regulations based on theoretical performance metrics create a design space that can be optimized ("gamed") by architects, potentially circumventing the policy's intent.
Marketing vs. Architectural Classification: The analysis in Section 5.2 (Page 9), which highlights the ambiguity of marketing-based classifications for data center vs. non-data center devices, is a concrete and well-argued contribution. Figure 10 effectively makes this point.

Weaknesses

Reliance on an Unvalidated Modeling Framework: The paper's entire quantitative analysis hinges on the LLMCompass framework [80], a prior work from the same authors. There is no validation in this paper of LLMCompass's accuracy against any real-world sanctioned or compliant hardware (e.g., A100, H100, A800, H800). High-level simulators necessarily make abstractions; without a clear understanding and validation of the model's fidelity, the specific performance improvement figures (e.g., "4% and 27%" in the Abstract) are unsubstantiated. The area and cost models are similarly opaque, making the cost-benefit analysis in Section 4.4 and Figure 8 highly questionable.
Oversimplification of Regulatory Constraints (TPP): The DSE is constrained by a Total Processing Performance (TPP) limit. The authors define TPP with a straightforward formula (Equation 1, Page 6) based on systolic array dimensions, lane count, and core count. The actual definition provided by the Bureau of Industry and Security (BIS) is more nuanced and based on vendor-reported performance for "tensor operations." The authors' formula is an interpretation at best. It is a critical, unstated assumption that this simplified model accurately reflects the official metric that designers must adhere to. The validity of the entire DSE depends on this assumption, which is not justified.
Superficiality of "Architecture-First Policy" Proposal: Section 5 presents a policy proposal that is conceptually weak and lacks rigor.
- The proposal to use architectural features like L1 cache size or memory bandwidth as regulatory limits (Section 5.3) ignores the central premise of the paper: architects will design around constraints. Regulating L1 cache size would simply incentivize designers to develop more sophisticated prefetching, different cache hierarchies, or dataflows that are less reliant on L1. The paper fails to analyze these inevitable second-order, adversarial responses.
- The claim in Section 5.4 that restricting matmul hardware would "likely maintain high gaming performance" is an unsupported assertion. No data is presented to quantify the impact of matmul restrictions or removal on modern gaming workloads that increasingly rely on these structures for features like DLSS.
- The proposal is a solution that suffers from the exact problem it purports to solve. It merely shifts the target for "gaming" from one metric (TPP) to another (e.g., on-chip SRAM capacity).
Limited and Potentially Misleading DSE: The claim of a "thorough design space exploration" (Abstract, Page 1) is an overstatement. The DSE sweeps a handful of parameters while keeping others, such as clock frequency, fixed (Section 3.2, Page 5). Modern GPUs employ complex dynamic voltage and frequency scaling, and fixing this parameter is a major simplification. Furthermore, the baseline for comparison in Section 4.2 is the non-compliant NVIDIA A100. A more intellectually honest comparison for a compliant design would be against a naively compliant design (e.g., a model of the A800) to demonstrate the actual benefit of their architectural co-optimization. Comparing an optimized compliant design to a non-compliant one inflates the perceived benefits.

Questions to Address In Rebuttal

Please provide direct validation of the LLMCompass framework's performance, area, and cost models against real, publicly-documented GPUs discussed in this paper (e.g., A100, H100). How accurate are the absolute latency predictions (TTFT, TBT) and the die area estimations?
What evidence supports your assumption that TPP can be accurately modeled by Equation 1? Given the regulatory ambiguity, how sensitive are your conclusions about optimal compliant designs to potential variations in the true TPP calculation method?
The proposed "architecture-first" policy suggests regulating metrics like L1 cache size. How would this policy not be similarly "gamed" by designers, for example, by shifting reliance to a larger, faster L2 cache or developing novel dataflows? Please provide an analysis of the adversarial architectural responses your proposed policy would likely trigger.
Can you justify the decision to fix clock frequency in the DSE, given its significant impact on both performance and power? How would allowing frequency to be a free parameter change your results?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a timely and highly relevant study on the intersection of computer architecture and geopolitical export controls, specifically the U.S. Advanced Computing Rules (ACRs). The authors conduct the first detailed architectural design space exploration (DSE) to quantify the impact of these sanctions on chip design for large language model (LLM) inference. The core contribution is twofold: first, it provides a quantitative analysis of how current high-level metrics like Total Processing Performance (TPP) and Performance Density (PD) create specific, and at times counter-intuitive, design pressures and economic externalities. Second, and more significantly, it proposes an "architecture-first" approach to policymaking, arguing that using more granular architectural features (e.g., on-chip memory size, memory bandwidth) as regulatory levers can create more targeted, efficient, and economically sound policies.

Strengths

Exceptional Timeliness and Relevance: This work addresses one of the most critical topics at the nexus of technology, economics, and international security today. The analysis of real-world sanctions on cutting-edge hardware is of immediate interest to academics, industry practitioners, and policymakers alike. It transforms a subject often discussed in abstract policy terms into a concrete architectural design problem.
Novel Interdisciplinary Bridge: The paper's greatest strength is its successful bridging of the deep, technical world of computer architecture with the complex, nuanced field of public policy. The authors effectively translate policy constraints into an architectural DSE and, conversely, use architectural insights to propose better policy. This is a rare and valuable contribution that enriches both fields.
Strong Quantitative Foundation: The arguments are not merely speculative; they are backed by a thorough DSE using the LLMCompass framework. The results provide compelling evidence for the paper's claims. For instance, the demonstration that October 2023 rules incentivize increasing die area to pass the Performance Density check (Figure 2, page 4) is a brilliant, non-obvious insight that perfectly illustrates the unintended consequences of high-level metrics. Similarly, the violin plots in Section 5.3 (Figure 11, page 11) convincingly show that architectural parameters like memory bandwidth are far better predictors of decoding performance than TPP alone, providing a solid foundation for their policy proposal.
Constructive and Forward-Looking Proposal: The paper moves beyond critique to offer a well-reasoned solution. The "Architecture-First Policy" (Figure 3, page 4) is a clear conceptual framework that could genuinely improve technology governance. By showing how to create policies that inherently limit AI performance while preserving gaming performance (Section 5.4, page 10), the authors provide a practical example of how to minimize the negative externalities they identify earlier in the paper. This work lays the groundwork for a new sub-field one might call "Policy-Aware Hardware Design."

Weaknesses

While the core ideas are strong, the paper's impact could be broadened by addressing a few points. These are less flaws than they are opportunities for extension.

The Audience Gap in Policy Recommendation: The paper is written by architects, for architects. However, the ultimate audience for the policy recommendations is policymakers, who may lack the technical background to fully grasp the nuances of systolic array dimensions or L1 cache hierarchies. The paper could be strengthened by including a section that explicitly translates its technical findings into accessible, high-level policy principles or a mock "term sheet" for regulators.
The Durability of "Architecture-First" Metrics: The paper convincingly argues that current metrics can be gamed. A potential weakness of the proposed solution is that new, architecturally-aware metrics could also eventually be circumvented in the ongoing "cat-and-mouse" game between regulators and designers. A brief discussion on the resilience of these proposed metrics or a framework for how they might evolve would add depth to the proposal.
Scope Limited to LLM Inference: The analysis is tightly focused on LLM inference, which is a reasonable and relevant choice. However, the sanctions are also intended to limit the training of large models. The architectural bottlenecks for training can differ significantly from inference (e.g., greater emphasis on interconnect and FP32/BF16 performance). The paper's conclusions might be more powerful if they briefly discussed how the "architecture-first" approach could be adapted to target training workloads as well.

Questions to Address In Rebuttal

Your proposed "architecture-first" policy is compelling. How do you envision the process of translating these technical insights into actionable policy? What steps would be needed to bridge the knowledge gap between computer architects and the regulatory bodies like the Bureau of Industry and Security (BIS)?
You argue for using metrics like memory bandwidth and on-chip cache size. Could a motivated adversary not also "game" these metrics? For example, by designing a chip with massive but inefficient caches, or by using novel on-package interconnects that are not captured by a narrow definition of "device memory bandwidth"? How can an architecture-first policy remain robust against such co-option?
Your analysis focuses on LLM inference. Could you elaborate on how your framework would apply to LLM training, which is a key concern for regulators? Would the same architectural levers (e.g., L1 cache size) be as effective, or would the policy need to target different features (e.g., inter-chip interconnect bandwidth, specific data format support)?

Review 3

Excellent. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents an architectural analysis of high-performance computing hardware under the constraints of recent US Advanced Computing Sanctions. The authors use the LLMCompass simulation framework to perform a design space exploration (DSE) of accelerator architectures that comply with the Total Processing Performance (TPP) and Performance Density (PD) limits. Based on this analysis, the authors demonstrate that compliant designs can still be optimized for LLM inference workloads. The paper culminates in proposing an "Architecture-First Policy" framework, which suggests that future regulations should target specific architectural features (e.g., on-chip memory size, memory bandwidth) rather than high-level theoretical performance metrics like TPP to create more effective and targeted controls with fewer negative externalities.

Strengths

Novelty of Application Domain: The paper's primary strength is its application of established computer architecture analysis techniques to a novel and highly relevant domain: geopolitical technology sanctions. To my knowledge, this is the first work in a top-tier computer architecture venue to provide a quantitative DSE under these specific regulatory constraints. It successfully bridges the gap between hardware design and public policy.
A Well-Articulated Conceptual Framework: The proposed "Architecture-First Policy" (Figure 3, Section 5) is a clear and compelling conceptual contribution. The idea of shifting from reactive, performance-based limits to proactive, architecture-based limits is elegant. It provides a structured way for policymakers and architects to reason about creating targeted regulations.

Weaknesses

My review focuses exclusively on the novelty of the core technical and conceptual contributions, setting aside the timeliness of the topic. While the application is new, the underlying architectural principles and analytical methods are not.

Core Architectural Insights are Derivative of Prior Work: The central technical finding of the paper is that LLM inference performance can be selectively targeted by constraining different architectural components. Specifically, the authors show that prefill (TTFT) is compute-bound and can be limited by features like L1 cache size, while decoding (TBT) is memory-bound and can be limited by memory bandwidth (Section 5.3, Figure 11).

This insight is not new. The foundational premise that the prefill phase is compute-bound and the decoding phase is memory-bound is a widely established principle in the LLM inference literature. The very tool the authors use, LLMCompass, was presented in a prior paper [80] that explicitly discusses and models this dichotomy. Other works on LLM serving systems, such as Orca [78] and Megatron-LM [62], are built upon this fundamental understanding. Therefore, the conclusion that restricting compute-proximate resources (like L1 cache) hurts prefill and restricting memory system resources hurts decoding is an expected validation of established knowledge, not a novel discovery. The paper quantifies this effect within the sanction's design space, but the qualitative insight itself is part of the existing art.
Conceptual Precedent for "Architecture-First" Limiting Exists: The proposal to create domain-specific hardware limitations by targeting architectural features is not a fundamentally new concept.
- Industry Precedent: NVIDIA's "Lite Hash Rate" (LHR) technology [75] is a direct commercial precedent. NVIDIA modified its GPU firmware and drivers to specifically detect and throttle Ethereum mining performance while leaving gaming performance largely unaffected. This is a real-world implementation of the exact principle the authors advocate for: architecturally limiting a specific, undesirable workload. The authors acknowledge this in their related work (Section 6.3), but this undermines the novelty of their core policy proposal.
- Regulatory Precedent: Other technology export control regimes have long used architecture-specific metrics. For instance, the Wassenaar Arrangement [9] and US Export Administration Regulations (EAR) have historically placed controls on cryptographic hardware based on specific architectural details like symmetric key length or the ability to perform certain mathematical operations, rather than a generic "encryption performance" metric.
The authors' contribution is to apply this philosophy to AI accelerators, but the core idea of using fine-grained architectural features as a regulatory lever is not novel in itself.
Methodology is Application, Not Invention: The authors correctly and transparently state their use of the LLMCompass framework [80]. The DSE methodology is standard practice in computer architecture research. Therefore, the technical engine of this work is an application of an existing tool and a standard methodology to a new problem. This is a valid engineering study, but it lacks a core methodological or algorithmic novelty.

Questions to Address In Rebuttal

The core finding—that limiting L1 cache throttles prefill and limiting memory bandwidth throttles decoding—seems to be a direct and expected consequence of the well-known compute-bound vs. memory-bound nature of these LLM inference phases. Can the authors articulate what new architectural principle was discovered here, beyond confirming and quantifying this known behavior in a new design space?
The "Architecture-First Policy" proposal bears a strong conceptual resemblance to NVIDIA's LHR for crypto-mining and historical cryptography export controls based on key length. Could the authors more sharply define the novel delta between their proposed framework and these prior instances of architecture-specific performance regulation?
Your proposed policy levers include on-chip SRAM sizing and memory bandwidth. A key aspect of effective regulation is verifiability. How do the authors envision a regulatory body verifying these complex architectural parameters on a packaged chip, especially when marketing materials can be misleading and on-chip resources can be fused off or disabled via firmware? Is this proposal more practical to enforce than the current TPP/PD metrics?

Cambricon-SR: An Accelerator for Neural Scene Representation with Sparse Encoding Table

Abstract

Neural Scene Representation (NSR) is a promising technique for representing real scenes. By learning from dozens of 2D photos captured from different viewpoints, NSR computes the 3D representation of real scenes. However, the performance of NSR processing ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the perspective of 'The Guardian'.

Review Form:

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors propose Cambricon-SR, a co-designed algorithm (ST-NSR) and hardware accelerator for Neural Scene Representation (NSR). The core idea is to introduce sparsity into the hash encoding table to reduce memory accesses and computation. To support this, they propose several hardware units: a Sparse Index Unit (SIU) to filter invalid memory requests, a Sparse Update Unit to manage on-chip table updates, and a dynamic shared buffer for the MLP units. The authors claim a 1259x speedup over an A100 GPU and 4.12x over the prior-art Cambricon-R. While the paper presents a detailed architectural design, I have significant concerns regarding the fundamental evaluation methodology, the justification for key architectural trade-offs, and the validity of several simplifying assumptions.

Strengths

The core motivation to exploit sparsity in the NSR encoding table is sound. Previous work has identified the encoding stage as a bottleneck, and reducing memory traffic through pruning is a logical approach.
The paper provides a detailed hardware design, including specific microarchitectural components like the Sparse Index Unit (SIU) and the Sparse Update Unit, which directly address challenges introduced by the sparse algorithm.
The inclusion of an ablation study (Section 5.2.5, page 13) is commendable, as it attempts to isolate the performance contributions of the proposed architectural features.

Weaknesses

My primary objections to this work center on the validity of its core claims, which I believe are predicated on a flawed evaluation framework and questionable design choices.

Fundamentally Misleading Evaluation Methodology: The primary evaluation of modeling quality (Table 1, page 11) is performed at a fixed modeling time of 0.1 seconds. This is not a scientifically rigorous comparison. A faster accelerator will simply complete more training iterations in a fixed time budget. Comparing the quality of Cambricon-SR after N iterations to a GPU after M iterations (where N >> M) does not prove superiority; it merely states the obvious. The only valid comparison for systems with different per-iteration runtimes is time-to-target-quality. The authors must demonstrate the time and energy required for each platform (GPU, Cambricon-R, Cambricon-SR) to reach a predefined PSNR threshold (e.g., 25 dB, 30 dB) on each dataset. Without this, the reported 1259x speedup and the quality improvements shown in Table 1 are unsubstantiated.
Unjustified Architectural Cost and Complexity: The proposed architecture incurs massive overhead to manage sparsity.
- The use of 15 MB of Content Addressable Memory (CAM) is an extreme choice. As the authors note (Section 4.2, page 7), this accounts for 33.56 mm² or 14.29% of the total chip area. CAMs are notoriously power-hungry and do not scale well. The paper provides no justification for why a CAM-based approach was chosen over potentially more efficient hash-based or indexed data structures for address translation.
- The Sparse Index Unit (SIU) is similarly costly. Per Table 2 (page 12), it consumes 8.59% of the area but a disproportionately high 15.09% of the total power. The paper frames this as a worthwhile trade-off, but the energy cost is substantial and requires a more rigorous defense.
Critical Algorithmic Approximation Lacks Evidence: In Section 4.1 (page 7), the authors state they use an "imprecise computation of the threshold by using only half of the DT" to speed up the process. They claim this has "negligible impact on the representation accuracy... (less than 0.1)." This is a strong claim with absolutely no supporting data, figures, or ablation study provided in the paper. An approximation in a critical global parameter like the sparsity threshold could have cascading effects on model convergence and final quality. This claim is currently an unsupported assertion.
Architectural Regression in Memory Hierarchy: The dataflow introduces a significant step backward compared to the stated prior art. Cambricon-R is described as a "fully fused" architecture that keeps all data on-chip. In contrast, Cambricon-SR requires reading the entire off-chip dense table (DT) for the update stage. Figure 17 (page 13) explicitly shows that Cambricon-SR has substantially more off-chip memory access than Cambricon-R. Increasing DRAM traffic is a major architectural regression. This design trades on-chip access contention (in Cambricon-R) for a massive off-chip memory bottleneck, and it is not clear that this is a net performance win across all scenarios, especially as scenes grow in complexity.

Questions to Address In Rebuttal

The authors must address the following points for this paper to be considered for publication:

Provide a revised evaluation that replaces the fixed-time results in Table 1 with a time-to-target-quality analysis. How long does each platform take to reach a PSNR of 25 dB on the evaluated datasets?
Provide a dedicated ablation study that quantifies the accuracy impact of using only half the dense table for threshold computation. The claim of "< 0.1" impact must be substantiated with data across all datasets.
Justify the design choice of using a large, power-hungry CAM for address translation. What alternative mechanisms were considered, and why were they rejected? Present a comparison of the PPA (Power, Performance, Area) trade-offs.
The design increases off-chip memory traffic compared to Cambricon-R. Please provide a detailed justification for this architectural regression. What is the performance impact of the DRAM bottleneck during the update stage, and how does it limit the overall scalability of the proposed system?

Review 2

Of course. Here is a review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces Cambricon-SR, an algorithm-hardware co-designed accelerator for Neural Scene Representation (NSR) that aims to overcome the performance-quality limitations of previous work. The authors' core insight is that the multi-resolution hash encoding table, a well-known memory bottleneck in modern NSR models like Instant-NGP, is highly compressible and can be made sparse with negligible impact on final rendering quality.

To leverage this insight, they first propose a novel algorithm, Sparse Table NSR (ST-NSR), which dynamically prunes the encoding table during training to achieve over 80% sparsity. They then present a dedicated hardware architecture designed to exploit this algorithm-induced sparsity. The key hardware contributions include: 1) a Sparse Index Unit (SIU) to efficiently filter memory requests to pruned table entries, addressing the challenge of irregular access to the sparsity bitmap; 2) a Sparse Update Unit to manage the dynamic on-chip sparse table efficiently; and 3) a Dynamic Shared Buffer for MLP units, which improves area efficiency and allows for greater parallelism. The co-design results in a system that achieves a remarkable 4.12x speedup over the previous state-of-the-art accelerator (Cambricon-R) while simultaneously improving modeling quality by enabling more training iterations within the same time budget.

Strengths

This is a strong paper with a clear and compelling central thesis. Its primary strengths lie in its holistic approach and its successful targeting of a fundamental bottleneck.

Excellent Algorithm-Hardware Co-design: The work is a prime example of successful co-design. The ST-NSR algorithm creates a massive optimization opportunity (sparsity) that is difficult to exploit on general-purpose hardware like GPUs (as shown in their analysis in Section 3.3, page 6). The proposed hardware, particularly the Sparse Index Unit (SIU), is a non-obvious and elegant solution tailored specifically to capitalize on this opportunity. This virtuous cycle, where algorithm and hardware enable each other, is the paper's greatest strength.
Addressing the Root Cause, Not the Symptom: Previous work, including the impressive Cambricon-R, focused on managing the massive number of fine-grained, irregular memory accesses to the encoding table. This paper takes a more fundamental approach by aiming to eliminate the majority of those accesses at their source. By identifying and exploiting the inherent sparsity of the scene representation itself, the authors are tackling the root cause of the performance bottleneck, leading to a more profound improvement.
Strong Contextual Framing and Motivation: The paper does an excellent job of positioning itself within the existing landscape. The performance-quality trade-off is clearly articulated and visualized in Figure 1 (page 2), which provides a powerful motivation for the work. The authors demonstrate a deep understanding of the limitations of prior art and build a convincing narrative for why their sparsity-based approach is the correct next step.
Significant and Well-Validated Impact: The results are outstanding. A 4.12x speedup over a specialized accelerator and a >1000x speedup over a high-end GPU are top-tier results. More importantly, the authors don't just report speedup; they demonstrate that this speedup translates directly into higher modeling quality (Table 1, Figure 13, page 11). This closes the loop and proves that their system genuinely advances the state of the art for practical applications. The thorough ablation study (Figures 18 and 19, page 13) provides strong evidence for the efficacy of each architectural component.

Weaknesses

The paper is very well-executed, and the weaknesses are more avenues for future discussion than critical flaws.

Generalizability of the Sparsity Assumption: The entire architecture's effectiveness hinges on the assumption that NSR encoding tables are highly sparse. While the authors validate this across eight diverse datasets, this property is presented as an empirical observation. The work would be strengthened by a brief discussion on the theoretical underpinnings of this sparsity. Is it tied to the surface-to-volume ratio of typical scenes? Are there pathological cases (e.g., volumetric media like clouds, highly detailed fractal geometry) where the table might become dense, and how would the architecture's performance degrade?
The Off-Chip Dense Table as a Latent Bottleneck: The design requires maintaining a full dense table (DT) in off-chip DRAM to accumulate gradients and periodically regenerate the on-chip sparse table (ST). As shown in Figure 17 (page 13), this leads to Cambricon-SR having more off-chip memory traffic than Cambricon-R. While the Sparse Update Unit mitigates this, the dependency on this off-chip DT represents a potential scaling limit, especially for extremely long training runs or scenarios requiring more frequent updates.

Questions to Address In Rebuttal

The core premise of the paper is the high sparsity observed in the encoding table. Could the authors comment on the sensitivity of Cambricon-SR's performance and efficiency to the sparsity rate? For instance, if a scene required only 50% sparsity instead of >80%, how would the speedup over Cambricon-R be affected? Does the system's advantage degrade gracefully as the table becomes denser?
The Sparse Index Unit (SIU) is a critical component for performance, but it also consumes a non-trivial amount of area (8.59%) and power (15.09%), as detailed in Table 2 (page 12). Could the authors elaborate on how the throughput of the SIU was matched to the rest of the pipeline? Is there a risk that for certain access patterns, the SIU itself could become the bottleneck rather than the Sparse Table Array it is designed to protect?
The work focuses on accelerating the training/modeling of static scenes. Many future applications in robotics and AR will require modeling dynamic scenes. How do the authors envision the ST-NSR algorithm and the Cambricon-SR architecture adapting to dynamic environments where the sparsity pattern of the encoding table might change radically and continuously? Would the overhead of updating the on-chip ST from the off-chip DT become prohibitive?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents Cambricon-SR, a co-designed algorithm and hardware accelerator for Neural Scene Representation (NSR). The core idea is to introduce and exploit sparsity in the hash encoding table, which is a known performance bottleneck in modern NSR algorithms like Instant-NGP.

The authors' claims to novelty can be distilled into three primary contributions: 1. A new training algorithm, ST-NSR, which applies magnitude-based pruning to the NSR encoding table to create a sparse representation. 2. A novel hardware unit, the Sparse Index Unit (SIU), specifically designed to efficiently filter memory requests to the sparse encoding table by transforming the irregular access pattern to a bitmap into a sequential scan-and-match problem. 3. A dynamic shared buffer architecture for the accelerator's MLP units to improve hardware utilization and enable scaling.

The work claims that this co-design results in significant speedups over GPU and a prior-art accelerator (Cambricon-R) by reducing memory traffic and enabling greater parallelism.

Strengths

The primary strength of this paper lies in identifying a new target for a well-known technique and developing a non-trivial hardware solution to manage the resulting challenges.

Novel Application of Sparsity: While magnitude-based pruning is a canonical method for model compression, its application directly to the multi-resolution hash encoding table during NSR training appears to be genuinely novel. The authors correctly distinguish their contribution from prior works that focus on sparsity in the sampling stage (e.g., occupancy grids in Instant-NGP [36]) or in the scene representation itself (e.g., sparse voxel fields [28]). This work targets the learned parameter table, which is a different and valid approach to optimization. The claim of proposing the "first NSR algorithm with sparse encoding table" (Section 1, page 3) seems well-supported by the cited literature.
The Sparse Index Unit (SIU) Microarchitecture: The problem created by the ST-NSR algorithm—namely, the highly irregular, fine-grained access to a sparsity bitmap—is a difficult one. The proposed SIU (Section 4.4, page 8) is a clever and non-obvious microarchitectural solution. The core idea of converting a massive random-access problem into a series of parallel sequential-scan-and-match operations (Figure 10, page 9) is a significant engineering innovation. It avoids the intractability of a massive crossbar or the bank conflicts of a simple banked SRAM, demonstrating a deep understanding of the hardware design trade-offs.
Strong Co-Design Narrative: The work successfully presents a compelling algorithm-hardware co-design story. The algorithm (ST-NSR) creates a new performance opportunity but also a new hardware challenge (irregular bitmap access). A novel hardware unit (SIU) is then proposed to solve that specific challenge. This tight coupling between algorithm and architecture is the hallmark of a strong co-design paper.

Weaknesses

The novelty of the contributions is not uniform, and the paper could be more critical in positioning some of its ideas against established concepts in computer architecture.

Limited Novelty of the Dynamic Shared Buffer: The third major contribution, the "dynamic shared buffer for the MLP units" (Section 4.5, page 10), is presented as a novel proposal. However, the core concepts are not new. Buffer sharing among parallel processing units to improve utilization is a standard technique in accelerator design. Similarly, dynamic memory management based on tensor liveness (releasing activation memory after its last use in the backward pass) is a foundational optimization in deep learning compilers and runtimes. While its application here is well-executed and the design space exploration is thorough (Figure 12, page 11), it represents an application of existing principles rather than the introduction of a new one. This contribution is more of an engineering optimization than a conceptual breakthrough.
Complexity of the Proposed Solution: The introduction of sparsity necessitates a cascade of complex hardware: the CAMs for address translation in the Sparse Table Array (Section 4.2, page 7), the intricate logic of the Sparse Update Unit (Section 4.3, page 8), and the sophisticated SIU itself. While the performance gains are substantial, the resulting hardware is significantly more complex than the baseline architecture. The paper demonstrates that the benefits outweigh the costs, but the degree of novelty must be weighed against this increase in design complexity. The core innovation (sparse tables) necessitates a large amount of non-trivial, but perhaps less-novel, supporting engineering.

Questions to Address In Rebuttal

On the "First Algorithm" Claim: The authors claim ST-NSR is the "first NSR training algorithm with sparse encoding tables" (Section 1, page 2). While the paper differentiates this from sampling-level sparsity, can the authors elaborate on any other prior art that applies pruning directly to parameter tables in similar hash-based feature grid models, even if outside the specific domain of NeRF/NSR? Defending this claim more broadly would strengthen the paper's primary contribution.
On the Novelty of the Buffer Architecture: The concept of dynamic shared buffers (Section 4.5, page 10) is a well-established technique for improving hardware utilization. Could the authors clarify the specific novel aspects of their management scheme beyond its application to this particular accelerator, and contrast it with memory management strategies used in other data-parallel accelerators or deep learning frameworks?
On Alternatives to the Sparse Index Unit: The SIU is an impressive but complex design. Have the authors considered alternative, simpler mechanisms for filtering invalid accesses? For example, a probabilistic data structure like a Bloom filter could potentially filter a majority of invalid requests with a much lower area and complexity footprint, at the cost of allowing a small fraction of invalid requests to pass through to the NoC. What is the justification for the chosen deterministic, but highly complex, SIU design over such alternatives?

Neoscope: How Resilient Is My SoC to Workload Churn?

Abstract

The lifetime of hardware is increasing, but the lifetime of software is not. This leads to devices that, while performant when released, have fall-off due to changing workload suitability. To ensure that performance is maintained, computer architects must ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present Neoscope, an ILP-based design space exploration tool intended to find System-on-Chip (SoC) configurations that are resilient to "workload churn"—the evolution of software over a device's lifetime. They introduce a quantitative framework to describe churn using two metrics, Magnitude (ε) and Disruption (Δ), and categorize churn into four types. The primary contribution is a method that, unlike prior work (HILP), claims to find a near-optimal SoC for a given workload in a single solver invocation by inverting the JSSP formulation. The paper then uses this tool to explore the design space under various churn scenarios and for different optimization objectives (performance, energy, cost, etc.).

However, the practical utility of the tool and the validity of its conclusions are severely undermined by a series of significant, and in some cases unjustified, abstractions and simplifications in both the hardware and workload modeling. The "optimality" claimed is only with respect to a model that deviates substantially from real-world systems, casting doubt on the real-world applicability of the derived insights.

Strengths

Problem Formulation: The paper correctly identifies a critical and timely problem. As hardware lifetimes extend, designing for future, unknown workloads is an increasingly important challenge for computer architects. The attempt to formalize this problem is commendable.
Churn Metrics: The introduction of Magnitude (ε) and Disruption (Δ) (Section 3.2, page 4) provides a simple and intuitive, if basic, vocabulary for quantifying and discussing workload churn. This framework offers a useful conceptual lens.
ILP Formulation Extension: The technical extension to the HILP model using "exclusive machine ranges" (Section 3.4, page 5) to allow the solver to select hardware configurations is a clever methodological contribution. This inversion of the problem is the core technical novelty of the tool itself.

Weaknesses

Critically Oversimplified Hardware Models: The credibility of the entire study hinges on the fidelity of its models, which are demonstrably insufficient.
- The model for Dedicated Special Accelerators (DSAs) as having a fixed "4x efficiency advantage over a GPU with the same area" (Section 4, page 7) is arbitrary and lacks empirical grounding. The paper's central trade-off between general-purpose GPUs and specialized DSAs is entirely dependent on this single, hardcoded number. The conclusions are therefore not generalizable and may be artifacts of this specific assumption.
- The decision to omit Dynamic Voltage and Frequency Scaling (DVFS) from the model (Section 3, Implementation, page 6) is a fatal flaw for any analysis claiming to optimize for energy, power, or cost efficiency. Modern SoCs rely fundamentally on DVFS to manage these resources. Conclusions about the energy efficiency of "medium-size SoCs" (Abstract) or the layout of "carbon-optimal SoCs" (Section 5.3) are rendered speculative at best, as they ignore the primary mechanism for energy management in real hardware.
Artificial Workload Manipulation: The experimental setup relies on unrealistic modifications to the workload benchmarks.
- The authors state they "artificially squash all setup and teardown phase times by 20x" (Section 4, page 7) to prevent CPU bottlenecks. This is a significant intervention that fundamentally alters the nature of the applications, biasing the problem to be almost entirely about accelerator performance. This calls into question any conclusions about the required number of CPU cores or the overall SoC composition, as a primary source of system-level contention has been artificially removed.
- The churn models themselves are synthetic. While based on real-world growth rates, the mechanisms—such as uniformly multiplying AIO counts or swapping the two most expensive AIOs (Section 4, page 8)—do not capture the true nature of software evolution. Real churn often involves the introduction of entirely new algorithms and computational patterns (i.e., new AIO types), which this framework is incapable of modeling. The study is therefore limited to resilience against re-balancing of existing work, not the emergence of new work.
Unjustified Claims of Optimality: The paper repeatedly uses terms like "globally optimal" and "near-optimal" (Abstract, Section 3.3). This is misleading. The tool finds an optimal solution for an abstracted, simplified model. Given the points above, there is a significant risk that a configuration deemed optimal by Neoscope would be far from optimal when implemented in a real system with DVFS, realistic CPU overheads, and more complex DSA/GPU trade-offs. The conclusions are presented with a degree of certainty that the underlying model does not support.

Questions to Address In Rebuttal

Please provide a rigorous justification for the "4x efficiency advantage" of a DSA over a GPU. Crucially, provide a sensitivity analysis showing how the key findings (e.g., the SoC compositions in Figure 8) change if this factor is varied (e.g., to 2x or 8x).
How can the paper's conclusions regarding energy, cost, and carbon efficiency (Sections 5.3, 5.4, 6) be considered valid in the complete absence of DVFS modeling? Please defend this methodological choice and explain why it does not invalidate your findings.
Justify the decision to reduce CPU-bound phase times by a factor of 20x. How would the optimal SoC configurations change if this artificial reduction was removed or lessened (e.g., to 2x)?
The study explores churn by re-weighting existing AIOs. How would your framework and conclusions be affected by a more realistic churn scenario where entirely new AIOs (incompatible with existing DSAs) are introduced over the SoC's lifetime? Please address this fundamental limitation of your churn model.
Given the significant abstractions in the model, do you agree that all claims of "optimality" should be strictly qualified as "optimal with respect to the proposed abstract model" rather than implying real-world optimality?

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces the concept of "workload churn" to characterize the evolution of software over the increasing lifespan of hardware. It makes a significant conceptual contribution by formalizing churn along two axes: Magnitude (change in total work) and Disruption (change in the mix of operations). Based on this framework, the authors present Neoscope, a multi-objective pre-silicon design space exploration (DSE) tool. Neoscope's core technical innovation is its use of Integer Linear Programming (ILP) and job-shop scheduling concepts, not just to find an optimal schedule for a given System-on-Chip (SoC), but to co-optimize the schedule and the SoC configuration itself within a single invocation. Building on prior work like HILP [32], it "inverts" the problem to find the ideal hardware for a given workload trajectory. The authors use Neoscope to explore the resilience of SoCs to different churn profiles (Minimal, Perturbing, Escalating, Volatile) and derive several key architectural insights regarding specialization, over-provisioning, and the impact of optimizing for metrics like energy, cost, and carbon footprint.

Strengths

Excellent Problem Formulation and Timeliness: The single most important contribution of this paper is the clear and quantitative framing of the "workload churn" problem. The concepts of Magnitude and Disruption provide a much-needed vocabulary and analytical framework for a challenge that is central to modern computer architecture. As hardware replacement cycles lengthen and software evolves at an accelerating pace, designing for lifetime performance consistency rather than just launch-day peak performance becomes paramount. This work is perfectly positioned at the intersection of these trends.
Novel and Powerful Methodology: The technical approach of Neoscope is both clever and impactful. By extending the ILP model from HILP [32] to include hardware selection via "exclusive machine ranges" (Section 3.4, page 5), the authors elegantly sidestep the need for massive, brute-force parameter sweeps that plague traditional DSE. This makes a comprehensive exploration of resilience not just theoretically possible, but computationally tractable. This is a significant step forward for early-stage architectural modeling.
Rich and Insightful Architectural Analysis: The paper does not merely present a tool; it effectively uses the tool to uncover non-obvious design principles. The findings—that small SoCs are particularly vulnerable to high-Disruption churn, that over-provisioning is beneficial for high-Magnitude churn (Section 5.1, page 9), and that medium-sized SoCs strike the best balance for energy efficiency (Section 5.3, page 10)—are valuable heuristics for architects. The analysis of carbon optimality (Section 6, page 11), which suggests that large amounts of "dark silicon" in the form of specialized DSAs can be optimal, is particularly counter-intuitive and thought-provoking. This demonstrates that the tool can genuinely guide architects toward better, more sustainable designs.
Excellent Contextualization: The work is well-situated within the existing literature. It clearly identifies its intellectual lineage from models like MultiAmdahl [48] and HILP [32] and articulates its value proposition against more detailed simulators like Aladdin [35]. The integration of carbon modeling concepts from FOCAL [10] shows a commendable awareness of adjacent, important research areas.

Weaknesses

While the core ideas are strong, the work rests on several abstractions whose implications could be explored more deeply.

Fidelity of the Churn Model: The modeling of churn by adjusting AIO [33] counts is a necessary and effective abstraction for the ILP framework. However, true workload evolution can be more complex than simply changing the frequency of existing operations. A fundamental algorithmic shift (e.g., the move from CNNs to Transformers in AI) might introduce entirely new types of operations that are not in the initial AIO set. The current model captures this as a high-Disruption event but doesn't fully account for the introduction of fundamentally new work primitives.
Simplicity of the Hardware Model: The paper models DSAs as having a fixed "4x efficiency advantage over a GPU with the same area" (Section 4, page 7). This is a reasonable simplification for an early-stage tool, but the true benefits of specialization are highly dependent on the specific task. The strength of the paper's conclusions, particularly those regarding the dominance of DSAs in carbon-optimal designs, may be sensitive to this assumption.
ILP Scalability Concerns: The authors are transparent about the significant computational cost (up to 10,000 CPU hours, Section 4, page 7), but a more formal discussion on the scalability limits of the ILP formulation would be beneficial. How does the solver complexity scale with the number of IP block options, AIO types, or churn time-steps? Understanding these boundaries is crucial for assessing the tool's applicability to even more complex, future SoCs.

Questions to Address In Rebuttal

Could the authors elaborate on how the Neoscope framework might handle a "Volatile" churn scenario where entirely new AIOs (not present at year 0) are introduced at year 5? Does the model assume a static, known universe of all possible AIOs for the device's lifetime, or could it be adapted for such unforeseen changes?
The conclusion that dark silicon (in the form of DSAs) is carbon-optimal is fascinating. How sensitive is this finding to the 4x DSA efficiency assumption? For example, if the advantage were only 2x, at what point on the operational-vs-embodied carbon spectrum (the α_E2O sweep in Figure 10) does the more general-purpose GPU become the more carbon-optimal choice?
The paper focuses on pre-silicon DSE. Given its powerful modeling capabilities, could the authors speculate on how the Neoscope methodology could be applied in a post-silicon context? For instance, could it be used to guide strategic software updates for an existing device fleet to maximize performance on aging hardware, or to inform resource management on FPGAs in the field?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents Neoscope, a design space exploration (DSE) framework for System-on-Chips (SoCs) intended to be resilient to "workload churn." The authors' claimed contributions are twofold. First, they propose a conceptual framework to quantify workload churn using two metrics: Magnitude (ε), representing the change in total work, and Disruption (Δ), representing the change in the operational mix. Second, they present a multi-objective optimization tool, based on Integer Linear Programming (ILP), that extends a prior framework (HILP [32]) to synthesize a near-optimal SoC configuration in a single invocation, rather than requiring an iterative parameter sweep over fixed SoC designs. The core technical mechanism for this is an ILP formulation that allows the solver to select hardware components from a library. The paper then uses this tool to explore the design space under various churn scenarios and for different optimization objectives (performance, cost, carbon, etc.).

Strengths

The primary novel contribution of this work is methodological, residing in the formulation of the SoC design problem. The authors correctly identify a key limitation in their direct predecessor, HILP [32]: HILP can find an optimal schedule for a workload on a fixed SoC, but finding the optimal SoC itself requires an external, brute-force search loop.

The core innovation in Neoscope is the extension of the HILP model to absorb this outer search loop into the ILP formulation itself. This is achieved by introducing what the authors call "exclusive machine ranges" (Section 3.4, page 5). By modeling a library of IP blocks (e.g., a GPU with 1, 2, 4, or 8 SMs) as a set of mutually exclusive "machines" in the Job-Shop Scheduling Problem (JSSP) formulation, the solver is forced to choose exactly one configuration for each component type as part of the optimization. This transforms the problem from "optimal scheduling on a fixed design" to "simultaneous optimal design selection and scheduling." This is a significant and elegant step forward from the prior art they cite. While ILP-based synthesis is not new in itself, its application to invert the DSE paradigm at this level of architectural abstraction is a genuinely novel approach within this specific context.

The second, softer contribution is the formalization of "churn" via the ε and Δ metrics (Section 3.2, page 4). While the concept of workload evolution is not new, providing a simple, quantitative framework to describe it is valuable. It provides a clear and useful vocabulary for the problem domain.

Weaknesses

My main critique concerns the framing of the novelty. While the application of the ILP technique is novel, the underlying technique itself is not.

The Core ILP Mechanism is Standard: The concept of "exclusive machine ranges" is a standard modeling technique in mathematical programming, often implemented using Special Ordered Sets (SOS1) or simple binary variable constraints to enforce a "pick one from N" choice. The authors should explicitly acknowledge that the novelty is not in the invention of this ILP constraint style, but in its specific application to augment the JSSP model for SoC synthesis. The current framing could be misinterpreted as the invention of a new class of constraint.
Overlaps with High-Level Synthesis (HLS): The problem of simultaneously performing resource allocation and scheduling is the central challenge of High-Level Synthesis. For decades, HLS research has employed ILP-based methods to solve this exact problem, albeit at a lower level of abstraction (i.e., synthesizing RTL from C/C++). For instance, ILP formulations have been used to determine the optimal number of functional units (adders, multipliers) and to schedule operations onto them to meet latency or area constraints. While Neoscope operates at a much coarser, architectural level (CPUs, GPUs, DSAs vs. ALUs), the fundamental problem structure is conceptually identical. The paper is missing a discussion of this vast body of prior art and fails to clearly differentiate its contribution from it. The novelty is in the level of abstraction, not in the fundamental approach of using ILP for co-optimization of resources and schedule.
Breadth of the "First Tool" Claim: The abstract claims Neoscope is the "first multi-objective pre-silicon design space exploration tool for investigating... architectures that are resilient to workload churn." This claim is too strong. DSE tools are numerous, and many are multi-objective. The unique, novel element is the single-invocation synthesis method based on the HILP extension. The novelty claim should be sharpened to focus on this specific mechanism, as it is the paper's most defensible and significant contribution.

Questions to Address In Rebuttal

The authors should use the rebuttal to clarify the precise boundaries of their novel contributions.

Can the authors please confirm that the core "exclusive machine ranges" technique is a standard ILP modeling pattern and re-state their contribution as the novel application of this pattern to the architectural DSE problem defined in HILP [32]?
Please elaborate on the relationship between Neoscope and the body of work on ILP-based High-Level Synthesis. How is your problem formulation fundamentally different from prior HLS work that also co-optimizes resource allocation (i.e., component selection) and scheduling, beyond the obvious difference in architectural abstraction level (SoC blocks vs. functional units)?
The proposed churn framework (ε and Δ) is based entirely on aggregate execution times per AIO type. Does this abstraction hide other critical aspects of churn, such as changes in memory access patterns, communication-to-computation ratios, or control flow divergence, which might render a synthesized SoC suboptimal despite ε and Δ remaining stable? Why is this execution-time-centric view sufficient?

CORD: Low-Latency, Bandwidth-Efficient and Scalable Release Consistency via Directory Ordering

Abstract

Increasingly, multi-processing unit (PU) systems (e.g., CPU-GPU, multi-CPU, multi-GPU, etc.) are embracing cache-coherent shared memory to facilitate inter-PU communication. The coherence protocols in these systems support write-through accesses that ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors present CORD, a cache coherence protocol that enforces release consistency for write-through accesses by ordering them at the LLC directory rather than at the source processor. The stated goal is to eliminate acknowledgment messages required by source-ordering protocols, thereby reducing latency and interconnect traffic. The core mechanisms involve decoupling sequence numbers into epoch numbers and store counters for single-directory ordering, and an inter-directory notification system for multi-directory scalability.

While the fundamental premise of offloading ordering to the directory is plausible for simple communication patterns, this paper's central claims of low-latency and scalability are not rigorously substantiated. The proposed mechanisms appear to merely shift the performance bottleneck from a processor stall awaiting acknowledgments to a directory stall awaiting cross-directory notifications, and the protocol's performance is shown to be highly sensitive to communication fan-out, contradicting the claim of scalability.

Strengths

Problem Motivation: The paper effectively identifies and quantifies the performance and traffic overheads associated with acknowledgment messages in source-ordered (SO) write-through coherence protocols (Section 3.1, Figure 2). This provides a clear justification for exploring alternatives.
Core Single-Directory Mechanism: The use of decoupled epoch numbers and store counters (Section 4.1) is a technically interesting approach to managing ordering metadata for write-throughs, aiming to reduce traffic overhead for frequent Relaxed stores.
Evaluation Breadth: The experimental setup compares CORD against relevant baselines (SO, MP, WB) across two different interconnect technologies (CXL, UPI) and includes a sensitivity analysis of key application parameters.

Weaknesses

Misleading Latency Claims: The paper claims that CORD "eliminates processor stall" (Section 5, Figure 5 caption). This is misleading. While the source processor may not stall waiting for an acknowledgment, the critical Release operation is effectively stalled at the destination directory until all notifications from pending directories are received (Section 4.2, p. 6). The latency bottleneck is simply relocated, not eliminated. The critical path for a Release store is now dependent on a potentially high-fanout broadcast/gather operation between directories, which is not an improvement in all cases.
Unsubstantiated Claim of Scalability: The title and abstract prominently feature "Scalable" as a key contribution. However, the inter-directory notification mechanism has a worst-case control message complexity of 2n-1 for a Release store with n-1 pending directories. The authors' own sensitivity analysis (Section 5.3, Figure 8, right) directly contradicts the scalability claim, showing that CORD's performance benefit over SO rapidly diminishes and its overhead compared to MP significantly increases as communication fan-out grows. The protocol appears to be performant only under the assumption of low fan-out, which may not hold for future complex applications.
Insufficient Analysis of Overflow and Stalling: The proposed solution for handling the overflow of metadata look-up tables (e.g., unacknowledged epochs, store counters) is to stall the processor (Section 4.3, p. 7). The authors assert that "such worst-case scenarios are extremely rare" without providing sufficient evidence. A robust protocol design must be proven correct and performant even in adversarial, worst-case scenarios, not just for a set of "well-behaved" benchmarks. The potential performance impact of these stalls is not quantified.
Reliance on Favorable Workload Characteristics: The strong performance results appear to be heavily dependent on the chosen benchmarks having low-to-moderate communication fan-out and coarse-grained synchronization (Section 5.2, p. 9). The paper admits that for workloads with high fan-out (TRNS, MOCFE), CORD's performance advantage shrinks or reverses. This suggests the results are not generalizable and that CORD is a point solution for a specific class of applications rather than a broadly applicable, scalable protocol.

Questions to Address In Rebuttal

Please provide a detailed critical path analysis for a Release store in CORD with a fan-out of N directories. Quantify the stall time at the destination directory as a function of N and inter-directory latency, and directly compare this to the processor stall time it replaces in a source-ordered protocol.
Given that the sensitivity analysis in Figure 8 shows performance benefits decreasing from >60% to ~25% as fan-out increases from 1 to 7 hosts, how do the authors justify the claim that the protocol is "Scalable"? Please define the specific conditions under which CORD is expected to be more performant than the baseline source-ordering protocol.
The strategy for bounding storage is to stall. Can the authors provide data from a synthetic, adversarial benchmark designed to maximize metadata table pressure (e.g., high frequency of Release stores with minimal intervening Relaxed stores, combined with high network latency to delay completions)? What is the measured performance degradation due to these stalls?
In Section 5.4 (p. 12), the sub-linear scaling of network buffers and look-up tables at the directory is justified by stating that the number of recycled Release stores "scales sub-linearly". This appears to be circular reasoning. What is the fundamental architectural reason for this sub-linear scaling, rather than an artifact of the specific MPI alltoall workload?

Review 2

Excellent. This is a well-structured paper on an important topic. I will now analyze it from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents CORD, a novel cache coherence protocol designed to optimize the performance of Release Consistency (RC) in modern multi-processing unit (PU) systems. The authors identify a key inefficiency in current systems: the enforcement of memory ordering for write-through operations at the source processor ("source ordering"). This approach necessitates acknowledgment messages from the last-level cache (LLC) directory back to the source, incurring significant latency, traffic, and energy overheads, particularly in emerging AI/ML and HPC workloads that heavily utilize producer-consumer patterns.

The core contribution of CORD is to shift the responsibility of ordering these write-through operations from the source processor to the destination LLC directory ("directory ordering"). This eliminates the need for the performance-degrading acknowledgment messages. To achieve this efficiently and scalably, CORD introduces two key mechanisms: (1) a decoupled system of epoch numbers and store counters to track dependencies with minimal metadata and traffic overhead, and (2) a novel inter-directory notification mechanism that allows directories to coordinate directly, ensuring correct ordering across a distributed LLC without involving the source processor. The authors demonstrate through simulation that CORD significantly improves performance and reduces interconnect traffic compared to traditional source ordering, while offering a much simpler programming model than manually orchestrated message passing.

Strengths

The primary strength of this paper is its elegant and timely core idea. It addresses a real and growing performance bottleneck in the very systems that are becoming central to modern computing.

Clear Problem Identification and Motivation: The paper does an excellent job of identifying a specific, impactful problem in modern coherence protocols (e.g., AMBA CHI, CXL) as detailed in Section 3.1. The analysis in Figure 2 (page 3), which quantifies the overhead of acknowledgment messages, provides a compelling motivation for the work. The authors correctly position this problem in the context of heterogeneous computing and the prevalence of write-through policies for inter-PU communication.
Elegant and Fundamental Contribution: The central concept of moving the ordering point from the source processor to the destination directory is a fundamental shift in protocol design. It is a simple idea to state, but one with profound implications for performance. It directly attacks the root cause of the identified bottleneck—the round-trip communication for ordering—rather than attempting to mitigate its effects. This is the hallmark of strong systems research.
Pragmatic and Scalable Design: The authors demonstrate a deep understanding of the practical challenges. The decoupled epoch/store counter mechanism (Section 4.1, page 4) is a clever solution to the trade-off between metadata overhead and handling overflows. More importantly, the inter-directory notification mechanism (Section 4.2, page 5) shows that the authors are not designing for a simplistic, single-directory model but are tackling the complexity of modern, distributed, and sliced LLCs. This makes the proposal far more relevant and credible for future many-core and disaggregated systems.
Excellent Contextual Positioning: The paper effectively situates CORD in the design space between traditional cache-coherent shared memory and message passing. The discussion in Section 3.2, including the ISA2 litmus test example (Figure 3, page 4), clearly articulates why naive message passing fails to provide the system-wide guarantees of RC, highlighting the value of CORD in achieving message-passing-like efficiency without sacrificing the familiar and simpler shared-memory programming model. This framing makes the contribution's significance immediately apparent.

Weaknesses

The paper is strong, and its weaknesses are more about exploring the boundaries and interactions of the proposed idea rather than fundamental flaws.

Interactions with Other Memory Operations: The paper's focus is squarely on optimizing write-through operations under RC. Section 4.4 (page 7) briefly discusses interactions with write-back stores, loads, and dependencies. However, this section feels somewhat cursory. In real-world, complex applications, the interplay between CORD's directory-ordered write-throughs and traditional source-ordered write-backs, coherent reads, and I/O could be complex. The proposed solution of injecting barriers seems potentially heavy-handed and might negate some of CORD's benefits in workloads with a more balanced mix of memory traffic. A more thorough exploration of these interactions would strengthen the paper.
Generality Beyond Release Consistency: The epoch number and store counter mechanism is beautifully tailored to the semantics of Release Consistency, which distinguishes between Relaxed and Release operations. The evaluation under TSO in Section 6 (page 12) is insightful, showing that while CORD still provides a performance benefit, it incurs a traffic overhead compared to source ordering. This suggests that the elegance and efficiency of CORD are closely tied to the semantics of a weak memory model. The work would be even more impactful with a discussion on the fundamental principles that could be generalized to other, stronger memory models, or an explicit acknowledgment of this limitation.
Potential for New Bottlenecks: The inter-directory notification mechanism is a key part of the scalable design. However, it introduces a new communication pattern between directories. In scenarios with very high fan-out (a single core writing to data homed at many different directories before a Release) and fine-grained synchronization, it is conceivable that the "destination directory" of the Release store could become a new bottleneck, waiting on notifications from many peers. While the evaluation in Figure 8 (page 10) touches on fan-out, a deeper qualitative analysis of potential secondary effects and hotspotting at the directories would be valuable.

Questions to Address In Rebuttal

Regarding the interaction with write-back stores (Section 4.4), could you provide more detail on the frequency of injecting "additional directory-ordered Release barriers"? In a workload with a significant mix of write-through producer-consumer traffic and write-back traffic with locality, how much of CORD's performance gain might be eroded by these additional synchronization events?
The inter-directory notification mechanism is clever, but could it create contention on the network interfaces of the directory controllers themselves, especially in pathological "all-to-one" synchronization patterns? Does the design inherently avoid this, or is it a potential scalability concern for extreme workloads?
Your TSO evaluation (Section 6) is very interesting. It suggests that ordering all stores at the directory, as required by TSO, increases traffic because acknowledgements are still needed to confirm total ordering. Does this imply that the true "sweet spot" for directory ordering is exclusively for weak memory models like RC, where only a subset of operations (Releases) serve as ordering points that can be managed without acknowledgements for the operations preceding them?

Review 3

Excellent. I will now embody "The Innovator" and provide a peer review focused exclusively on the novelty of the research presented in the paper.

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present CORD, a cache coherence protocol designed to efficiently enforce release consistency (RC) in multi-PU systems that utilize write-through memory accesses. The central claim of novelty is the proposal to shift the point of ordering for these write-through operations from the source processor to the destination LLC directory. This "directory ordering" approach is intended to eliminate the performance and traffic overhead of acknowledgment messages required by conventional "source ordering" schemes (e.g., as implemented in AMBA CHI and CXL).

To realize this, the authors introduce two primary mechanisms: 1. A decoupled system of "epoch numbers" (for Release stores) and "store counters" (for Relaxed stores) to manage ordering at a single directory with minimal traffic overhead. 2. A novel "inter-directory notification" mechanism where directories communicate directly with one another to enforce ordering for operations that span multiple directory slices, thus enabling scalability without involving the source processor in the coordination.

The paper argues that this approach achieves the performance efficiency of message-passing systems while preserving the simpler, system-wide programming model of cache-coherent shared memory.

Strengths

From a novelty perspective, the paper's strengths are:

Clear Articulation of a Novel Architectural Approach: The core concept of "directory ordering" for write-throughs within a hardware cache coherence protocol that enforces system-wide release consistency appears to be a genuinely novel contribution. While destination-ordering exists in other domains (e.g., message passing), its application and formalization within a scalable, multi-directory hardware coherence framework for RC is not a trivial adaptation and represents a new design point.
Novel Scalability Mechanism: The inter-directory notification mechanism (Section 4.2, page 5) is the most significant novel element. In conventional protocols, scaling to multiple directories while maintaining ordering typically requires the source processor to act as a serialization point, collecting acknowledgments from all involved parties. CORD's approach of offloading this coordination to the directories themselves is a clever and previously unexplored method. It effectively creates a distributed mechanism to resolve a global ordering dependency.
Specific, Novel Optimization Technique: The decoupled epoch/store counter system (Section 4.1, page 4) is a novel microarchitectural technique tailored for the problem. Generic sequence numbers are not new, but splitting them to specifically match the semantics of Relaxed vs. Release stores—embedding the full counter only in infrequent Release messages—is an elegant optimization that directly addresses the traffic overhead trade-off. This demonstrates a deep consideration of the problem rather than a simple application of a known technique.

Weaknesses

The assessment of novelty must also consider conceptual precedents and the significance of the "delta" over prior art:

Conceptual Precedent in Other Domains: The fundamental idea of ordering at the destination rather than the source is the core operating principle of posted writes in message-passing interconnects like PCIe. The authors correctly identify that these interconnects only provide point-to-point ordering guarantees, which are insufficient for system-wide RC (as shown with the ISA2 litmus test in Section 3.2, page 3). However, this means CORD's novelty is not the invention of destination-ordering but rather its synthesis into a protocol that can enforce system-wide consistency. The paper should be careful not to overstate the fundamental novelty of the ordering location itself.
Echoes of Distributed Systems Concepts: The proposed mechanisms bear a resemblance to established concepts in distributed systems. The epoch/counter system is functionally similar to logical clock schemes used to establish causal ordering. The inter-directory notification is a form of distributed coordination. While the application in a low-latency hardware coherence protocol is novel, the work would be stronger if it acknowledged this conceptual lineage and more clearly distinguished how the hardware constraints and specific RC semantics lead to a fundamentally different solution than what is found in software-based distributed systems literature.
Complexity vs. Benefit Justification: The proposed mechanisms, particularly for multi-directory ordering, introduce non-trivial complexity. New hardware structures are needed at both the processor and directory (Figure 6, page 7), and new message types (ReqNotify, Notify) are added to the protocol. While the evaluation shows a clear benefit over source ordering, the novelty of this added complexity must be weighed against its gains. For workloads with high communication fan-out, the 2n-1 control message overhead (Figure 5, page 6) is a significant architectural cost. The novelty is therefore a new trade-off point, not a universally superior solution without cost.

Questions to Address In Rebuttal

The core idea of shifting the ordering point from the source to the destination has clear conceptual parallels in message-passing systems. Can the authors more precisely articulate the novel architectural challenges that arise when applying this concept to a cache-coherent, system-wide RC model that makes CORD a non-trivial or non-obvious extension of prior ideas?
The inter-directory notification mechanism is presented as a key contribution for scalability. Has any prior work in hierarchical or distributed directory coherence protocols proposed mechanisms for direct directory-to-directory communication to resolve ordering or forward requests, even if not for this specific purpose of write-through RC? Please contrast CORD's notification scheme with any such prior art.
The decoupled epoch and store counter mechanism (Section 4.1, page 4) is a specific implementation choice to reduce traffic. How does this technique compare to other sequence-based ordering mechanisms proposed in the literature for cache coherence or memory ordering (e.g., within processors or in other protocols)? Is the novelty the decoupling itself, or its application to directory-side RC enforcement?

Nyx: Virtualizing dataflow execution on shared FPGA platforms

Abstract

As FPGAs become more widespread for improving computing performance within cloud infrastructure, researchers aim to equip them with virtualization features to enable resource sharing in both temporal and spatial domains, thereby improving hardware ...

ACM LINK

Reviews

Review 1

Here is a peer review of the paper from the perspective of 'The Guardian'.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present Nyx, a system designed to enable task-pipelined dataflow execution within a virtualized, multi-tenant FPGA environment. The core mechanism involves "virtual FIFOs" (vFIFOs) that use off-chip DRAM to buffer data between dependent tasks, allowing producer tasks to execute and retire even if consumer tasks are not yet scheduled. The paper claims this is the first system to support such a model in this context and demonstrates significant performance improvements over a baseline task-parallel execution model.

While the problem is well-defined and the proposed mechanism is plausible, the evaluation relies on a potentially weak baseline and fails to adequately quantify critical performance and resource overheads. The reported performance gains appear optimistic, and the scalability of the approach is questionable given the high resource cost presented. The claims of superiority are not yet substantiated with sufficient rigor.

Strengths

Clear Problem Identification: The paper correctly identifies a significant limitation in existing FPGA virtualization frameworks—the lack of support for dataflow execution (task pipelining), which is a primary advantage of FPGA-based acceleration. The distinction between task-parallel and task-pipelined models is well-articulated in Section 2.1 and Figure 1.
Coherent Architectural Concept: The core idea of using vFIFOs to decouple producer and consumer tasks in time is a logical approach to enabling pipelining in a dynamically scheduled environment. The architecture described in Section 3 appears technically sound for implementing this concept.
Evaluation Across Schedulers: The testing of the Nyx architecture against five different scheduling policies (Section 4.1, page 7) is a strong point, as it demonstrates that the claimed benefits are fundamental to the architecture itself and not an artifact of a specific scheduling algorithm.

Weaknesses

Questionable Baseline Comparison: The baseline architecture is a custom implementation meant to emulate prior work [28, 33, 36, 46]. The authors state they "stripped off any operating system features that could introduce performance overheads" (Section 4.1, page 7). This raises concerns of a strawman argument. Systems like Coyote [28] integrate OS features as a fundamental part of their design and trade-offs. By removing them, the baseline may not be a fair or accurate representation of the state-of-the-art, potentially inflating Nyx's relative performance.
Unquantified vFIFO Overhead: The entire system hinges on the performance of vFIFOs, which use off-chip DRAM for communication. Traditional, high-performance dataflow relies on extremely low-latency on-chip FIFOs (BRAMs) or direct streaming connections. The paper provides no microbenchmarks or analysis quantifying the latency and throughput penalty of a vFIFO transfer compared to an on-chip transfer. The performance gains are shown relative to a task-parallel model which also uses DRAM for intermediate storage, but this obscures the true performance cost of virtualizing these communication channels compared to a non-virtualized static dataflow implementation.
Understated and Prohibitive Resource Costs: The implementation cost reported in Table 3 (page 10) is a major concern that is not sufficiently addressed. The base Nyx configuration consumes nearly 29% of the device's BRAMs. More alarmingly, the configuration required for the Optical Flow benchmark consumes 45.4% of all available BRAMs. This is a massive cost for the static shell, severely limiting the resources available for user logic and questioning the scalability of the approach. What happens with applications requiring more than 8 vFIFOs or more complex communication patterns? Framing this as merely "a new direction for future research" (page 11) is an understatement of a fundamental system limitation.
Potentially Misleading Performance Claims: The standalone benchmark results in Figure 6 (page 8), which report up to an 8.87x speedup, are conducted under a "non-sharing" policy. This is an artificial scenario that does not reflect the target use case of a shared, multi-tenant system. While useful for isolating architectural benefits, these dramatic numbers should not be highlighted as representative of real-world shared performance. The nearly 9x improvement for Image Compression, in particular, lacks a convincing explanation beyond generic statements about "overlapping operations" and seems like an outlier.
Limited Scope of Benchmarks: The benchmark suite (Table 2, page 7) appears to be carefully selected. Digit Recognition, with its two highly imbalanced tasks, conveniently shows that Nyx does no harm in cases unsuitable for pipelining. However, the evaluation lacks applications with more complex dataflow graphs (e.g., multiple fork-join stages, cyclic dependencies, or graphs requiring more vFIFOs than available regions). The robustness and generality of the Nyx approach are therefore not fully demonstrated.

Questions to Address In Rebuttal

Please provide a stronger justification for your baseline architecture. How can you assure the reviewers that your "stripped-down" task-parallel model is a fair and high-performance representation of the systems you cite [28, 33, 36], rather than a strawman?
Provide microbenchmark data comparing the latency and achievable throughput of a single data transfer through a vFIFO (DRAM) versus a standard on-chip BRAM-based FIFO on your target platform. This is essential to understanding the performance trade-offs of your virtualization approach.
The BRAM overhead of your infrastructure is exceptionally high, reaching 45.4% for one benchmark. Please address the scalability of this approach. How would Nyx support an application that requires 16 concurrent communication channels? Is the vFIFO architecture fundamentally limited by on-chip memory resources?
Please provide a detailed, cycle-level or architectural explanation for the 8.87x performance improvement observed for the Image Compression benchmark in Figure 6. Why does the task-parallel baseline perform so poorly on this specific application compared to others?
How would the Nyx architecture and vFIFO management scheme handle dataflow graphs with cyclic dependencies, which are common in some domains (e.g., iterative algorithms)? Your current DAG-based model (Section 3.1) seems to preclude this.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Nyx, a system designed to bridge a critical gap between two established paradigms in FPGA computing: the high-performance, single-application dataflow execution model and the high-utilization, multi-tenant resource sharing model common in cloud environments. The authors correctly identify that existing FPGA virtualization frameworks primarily support a task-parallel model, which introduces significant latency and stalls between dependent tasks, thereby negating one of the key architectural advantages of FPGAs.

The core contribution of Nyx is to enable true dataflow (task-pipelined) execution in a shared, virtualized environment. It achieves this through a novel abstraction called "virtual FIFOs"—independent, dynamically allocated communication channels that use off-chip DRAM as a backing store. This mechanism elegantly decouples producer and consumer tasks, allowing them to execute without being simultaneously scheduled and masking the latency of partial reconfiguration. The paper provides a comprehensive evaluation showing that Nyx significantly outperforms state-of-the-art task-parallel approaches in terms of response time, tail latency, and deadline violations across various workloads and system demands.

Strengths

Clear Identification of a Significant Problem: The paper's greatest strength is its clear articulation of a fundamental conflict in modern FPGA deployment. The authors correctly diagnose that as FPGAs move into shared cloud infrastructure, the virtualization models being developed are forcing applications into an inefficient, non-native execution model. The problem framing, especially with Figure 1 (page 3) and Table 1 (page 2), is exceptionally effective and immediately establishes the novelty and importance of the work.
Elegant and Well-Justified Core Abstraction: The concept of "virtual FIFOs" is a simple but powerful solution to the problem. In a dynamic, multi-tenant environment, statically linking tasks with on-chip resources is impossible. By using off-chip DRAM to buffer data, Nyx creates a flexible communication fabric that decouples the spatial and temporal scheduling of dependent tasks. This is the key enabling technology that makes virtualized dataflow possible, and it is a very insightful piece of systems design.
Strong Contextualization and Positioning: The authors have done an excellent job of positioning their work within the broader landscape of FPGA systems research. The Related Work section (Section 5, page 11) and the introductory comparison table (Table 1, page 2) clearly delineate how Nyx builds upon prior work in FPGA virtualization (e.g., Coyote, Nimblock) while introducing a fundamentally new capability that these systems lack. This demonstrates a mature understanding of the field.
Robust and Convincing Evaluation: The experimental methodology is thorough. The authors compare Nyx against a well-defined Baseline architecture that represents the current state of the art. The decision to evaluate across three distinct congestion scenarios ("Relaxed," "Standard," "Stressed") provides a comprehensive picture of the system's behavior under different real-world conditions. The results consistently demonstrate substantial improvements, lending strong credibility to the paper's claims.

Weaknesses

My critiques are less about flaws in the work and more about opportunities to further explore the implications and trade-offs of the proposed design.

Under-explored Performance Implications of DRAM-backed FIFOs: The primary trade-off of the virtual FIFO approach is the reliance on off-chip DRAM, which has significantly higher latency and lower bandwidth than on-chip BRAMs. While the paper quantifies the resource cost (Section 4.5, page 10), it does not deeply analyze the potential performance bottleneck. For applications with extremely high data rates between pipelined stages, could the DRAM bandwidth become the limiting factor, potentially eroding the gains from overlapping execution? A discussion on the break-even point or the characteristics of applications that would be sensitive to this trade-off would strengthen the paper.
Scalability with Respect to Complex Application DAGs: The paper acknowledges that fork/join patterns (as seen in Optical Flow) increase implementation costs. This hints at a potential weakness in scalability for more complex application graphs. The current model seems to excel at linear pipelines. How does the virtual FIFO management system and the FPGA hypervisor handle tasks that fan-out to many consumers or require synchronization from many producers? This scenario could introduce significant management overhead or contention on the virtual FIFO channels.
Missed Opportunity to Discuss Dataflow-Aware Scheduling: The paper demonstrates that Nyx can enhance the performance of existing scheduling policies (FCFS, SJF, Nimblock). However, the introduction of a dataflow execution model opens the door for entirely new, dataflow-aware scheduling policies. For example, a scheduler could prioritize co-locating producer-consumer pairs to minimize DRAM traffic or make decisions based on the "fullness" of virtual FIFOs. While developing such a scheduler is likely out of scope, a discussion of these future possibilities would highlight the broader impact of Nyx as an enabler for further research.

Questions to Address In Rebuttal

Could the authors comment on the potential for the off-chip DRAM bandwidth to become a performance bottleneck for certain classes of applications? Are there scenarios where the latency of accessing the virtual FIFO in DRAM would negate the benefits of hiding the reconfiguration latency?
How does the proposed hypervisor and virtual FIFO management scale to applications with complex fork/join topologies beyond the one mentioned? Is there a risk of control-plane overhead or resource contention becoming a limiting factor as the number of inter-task dependencies grows?
The paper demonstrates compatibility with existing schedulers. Have the authors considered the potential for novel scheduling policies that are explicitly dataflow-aware (e.g., policies that try to co-schedule producer/consumer pairs or prioritize tasks based on data availability)? Does the Nyx architecture provide the necessary hooks or information to the scheduler to enable such policies?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present Nyx, a system designed to enable dataflow (task-pipelined) execution within a virtualized, multi-tenant, partially reconfigurable FPGA environment. The central claim is that this is the first system to bridge the gap between high-performance, statically allocated dataflow models and flexible, task-parallel virtualization models. The core mechanism proposed is the "virtual FIFO" (vFIFO), a hypervisor-managed channel that uses off-chip DRAM to buffer data between producer and consumer tasks. This mechanism makes tasks agnostic to the scheduling state of their counterparts, allowing a producer to execute and stream data even if the consumer is not yet configured on the FPGA fabric. The authors demonstrate through extensive evaluation that this approach significantly reduces application response times, tail latency, and deadline violations compared to state-of-the-art task-parallel virtualization frameworks.

Strengths

The primary strength of this paper is its identification and solution to a genuine and important gap in the literature. While FPGA virtualization is a well-trodden field, prior work has indeed focused predominantly on task-parallel execution models, thereby sacrificing one of the FPGA's key architectural advantages: deep pipelining.

The core technical contribution, the vFIFO abstraction, is a clean and effective mechanism to solve the identified problem. By virtualizing the communication channel and leveraging DRAM for elasticity, the system successfully decouples the execution of dependent tasks in time and space, which is the fundamental enabler for pipelining in a dynamic environment. The authors' claim to be the "first" to integrate these concepts in a task-based shared FPGA environment appears to hold up against a survey of prominent prior art in the field (e.g., Coyote [28], Nimblock [33], AmorphOS [26]). These systems focus on scheduling and resource management for independent tasks, not on facilitating high-throughput, pipelined communication between them.

Weaknesses

My analysis focuses exclusively on the novelty of the core idea. While the specific synthesis of a hypervisor, partial reconfiguration, and DRAM-backed FIFOs for multi-tenant dataflow is new, the paper could strengthen its novelty claim by more thoroughly differentiating its core mechanism from conceptually adjacent prior art.

Conceptual Overlap with Latency-Insensitive Channels: The concept of channels that buffer data to handle timing variations between modules is not entirely new. The LEAP FPGA operating system [22] introduced "latency-insensitive communication channels" to facilitate communication between hardware modules. While LEAP's context was not multi-tenant virtualization in the same vein as Nyx, the fundamental idea of a buffered communication primitive bears resemblance. The authors' novelty claim rests on the vFIFO being managed by a hypervisor in a dynamic, multi-tenant context, which allows for producer/consumer tasks to be completely absent from the fabric. This is a crucial distinction, but it should be made more explicitly in the paper, perhaps with a direct comparison in the Related Work section (Section 5). The current dismissal of FPGA OSes as simply treating hardware tasks like software processes is too general and misses this specific technical overlap.
The "Virtual FIFO" is a well-known pattern: At its core, the vFIFO is a software-managed, DRAM-backed circular buffer. This pattern is a foundational concept in computer science for producer-consumer problems. The novelty here is not the data structure itself, but its application and integration into an FPGA virtualization layer. The paper presents it as a novel architectural component, which is true in this context, but it's important to frame the novelty as the system integration and management, not the underlying mechanism of a memory-based queue.
Complexity vs. Benefit Justification: The proposed solution introduces non-trivial resource overhead. As shown in Table 3 (page 10), the Nyx infrastructure consumes significantly more BRAM resources than the Baseline architecture (28.6% vs 9.4%, and a striking 45.4% for the fork/join variant). This is a substantial static cost for the virtualization infrastructure. While the performance benefits are impressive (Figure 7 shows >2x improvements), the paper frames this purely as a win. A more nuanced discussion on this trade-off is warranted. Is there a point where the BRAM cost of the static infrastructure limits the number or size of user kernels that can be deployed, potentially negating the benefits for certain workload mixes? The novelty is clear, but its practical viability hinges on this trade-off, which could be explored more deeply.

Questions to Address In Rebuttal

Please explicitly clarify the novelty of the vFIFO mechanism in comparison to the latency-insensitive channels proposed in prior work such as LEAP [22]. What specific features of the vFIFO (e.g., hypervisor control, handling of non-resident tasks) are fundamentally different and enable multi-tenant dataflow in a way prior channel-based designs could not?
The resource overhead of the Nyx infrastructure, particularly the BRAM usage for the vFIFO controllers and crossbars, is significant. Could you discuss the scalability of this approach? How does the resource cost grow as the number of reconfigurable regions and, consequently, the number of potential parallel virtual channels increases?
The paper focuses on a single-FPGA system. Does the vFIFO abstraction naturally extend to a multi-FPGA environment, where a producer task on FPGA-1 could stream to a consumer task on FPGA-2? If so, is the core mechanism novel in that context as well, or does it begin to overlap with existing work on data streaming in distributed FPGA systems?

HPVM-HDC: A Heterogeneous Programming System for Accelerating Hyperdimensional Computing

Abstract

Hyperdimensional Computing (HDC), a technique inspired by cognitive models of computation, has been proposed as an efficient and robust alternative basis for machine learning. HDC programs are often manually written in low-level and target specific ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the perspective of 'The Guardian.'

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present HPVM-HDC, a programming system comprising a C++-based language (HDC++) and a compiler aimed at simplifying the development of Hyperdimensional Computing (HDC) applications for heterogeneous targets, including CPUs, GPUs, and custom HDC accelerators. The system introduces high-level primitives for HDC operations and claims to generate performance-competitive code across these targets from a single source.

While the goal of a unified programming model for this emerging domain is laudable, the work suffers from significant methodological weaknesses in its evaluation and appears to overstate the capabilities of its compiler, particularly concerning the targeting of custom accelerators. The central claims of performance superiority and effortless retargetability are not substantiated by the evidence provided.

Strengths

Problem Identification: The paper correctly identifies a critical challenge in the HDC ecosystem: the lack of a portable, high-level programming model that can target a diverse and growing set of hardware, from general-purpose processors to specialized ASICs.
System Implementation: The authors have clearly implemented a non-trivial system, including a language frontend, an intermediate representation, and backends for multiple, distinct architectures. The effort to target both a digital ASIC and a ReRAM-based accelerator is noted.
Approximation Primitives: The introduction of primitives for managing accuracy-performance trade-offs, such as red_perf (Section 4.2), is a conceptually sound approach for an error-resilient domain like HDC.

Weaknesses

Exaggerated Compiler Capabilities for Accelerators: The core premise of a retargetable compiler is undermined by the implementation described for the HDC accelerators. The paper states that HPVM-HDC lowers encoding_loop, training_loop, and inference_loop primitives directly to the accelerators' high-level functional interfaces (Section 4.3, page 8). This is not compilation in a traditional sense; it is a library call wrapper. The system does not appear to synthesize accelerator code from fine-grained HDC primitives (e.g., matmul, hamming_distance). Consequently, the programmer is required to manually structure their code using these coarse-grained "loop" primitives to enable accelerator execution. This negates the claim of a truly hardware-agnostic programming model, as the programmer must be aware of the accelerator's coarse-grained operational model to leverage it.
Unconvincing Performance Claims on GPUs: The headline claim of a 1.17x geomean speedup over "optimized baseline CUDA implementations" is not rigorously defended. The authors themselves concede in Section 5.2 (page 10) that "This speedup is primarily the result of different tuning choices in HPVM-HDC generated code compared to the baseline codes." This is a critical flaw. A compiler evaluation must demonstrate that the compiler itself produces superior code from a similar or identical high-level abstraction. Attributing performance gains to manual tuning choices made by the paper's authors, which could have been applied to the baseline, invalidates the comparison. The evaluation is not measuring the compiler's efficacy but rather the authors' tuning effort. The 5% slowdown on HyperOMS further suggests the compiler's code generation is not universally superior.
Misleading Accelerator Performance Evaluation: The evaluation in Section 5.2 and Figure 6 is fundamentally flawed. The authors compare "device-only" performance of the ASIC/ReRAM accelerators against a Jetson Orin GPU. This metric is misleading as it omits the cost of data transfer between the host and the accelerator. The paper explicitly states (Section 5.2, page 10) that "the digital ASIC and its ARM host CPU only communicate at approximately 10 kbps." At such a low bandwidth, any non-trivial application would be overwhelmingly bottlenecked by data movement, rendering the "device-only" speedup practically meaningless. A valid evaluation must present end-to-end wall-clock time for the entire application task. The current presentation hides a fatal system bottleneck.
Weak Baselines and Metrics:
- The performance comparison against Python/NumPy on the CPU is acknowledged by the authors as weak but is still presented prominently in Figure 5. Comparing a compiled AOT language (C++) to an interpreted one (Python) is uninformative and inflates the perceived benefits of the system.
- Lines of Code (LOC), presented in Section 5.4 and Table 4, is a notoriously poor metric for programmability. Comparing C++ LOC to Python LOC is meaningless due to the languages' inherent verbosity differences. While the comparison to CUDA C++ is more relevant, the reduction is modest and does not, on its own, prove a significant improvement in developer productivity.

Questions to Address In Rebuttal

Regarding accelerator targeting: Can the HPVM-HDC compiler automatically identify and map a standard, fine-grained loop implementing inference (e.g., a C++ for-loop containing matrix multiplication and an argmin) to the accelerator's execute_inference function? Or must the programmer explicitly replace their fine-grained code with the __hetero_hdc_inference_loop intrinsic for the accelerator to be used? If the latter, how is this functionally different from a preprocessor macro or a library API call?
Regarding GPU performance: Please provide a controlled experiment that isolates the performance gains attributable solely to the HPVM-HDC compiler's automatic code generation versus those from manual "tuning choices" that differ from the baseline implementations. For a fair comparison, the same algorithmic structure and tuning parameters should be applied to both the baseline and the HDC++ version.
Regarding accelerator evaluation: Please provide an end-to-end performance comparison for HD-Classification on the Digital ASIC, including all necessary data transfers over the 10 kbps host-device link for the Isolet dataset. How does the total wall-clock time compare to the Jetson Orin in this realistic scenario?
Regarding the approximation optimizations: The analysis in Section 5.3 is confined to a single application (HD-Classification). How can we be confident that these trade-offs, particularly the observation that encoding is less amenable to perforation than similarity search, generalize to other HDC workloads like those presented in Table 2?

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces HPVM-HDC, a complete, end-to-end programming system designed to address the programmability and portability challenges in the field of Hyperdimensional Computing (HDC). The authors propose a C++ based language, HDC++, which provides high-level, domain-specific primitives for HDC operations. This language is coupled with a retargetable compiler, HPVM-HDC (an extension of the existing HPVM framework), that can compile a single HDC++ program to a diverse set of hardware targets: multi-core CPUs, GPUs, a taped-out digital HDC ASIC, and a simulated ReRAM-based accelerator. The system also incorporates HDC-specific approximation optimizations, such as automatic binarization and reduction perforation, allowing developers to explore the performance-accuracy trade-off space inherent to HDC. The core contribution is not an algorithmic advance in HDC itself, but rather the creation of a critical piece of infrastructure—the first unified software ecosystem that bridges the gap between high-level HDC algorithm development and efficient execution on heterogeneous, specialized hardware.

Strengths

High Significance as Enabling Infrastructure: The most significant contribution of this work is its role as an enabler for the entire HDC research community. The success of any alternative computing paradigm, especially one that relies on custom accelerators, is critically dependent on a viable software stack. This work is to HDC what systems like TensorFlow/PyTorch and compilers like TVM/MLIR are to the deep learning ecosystem. By providing a unified path from a single source code to CPUs, GPUs, and custom hardware, HPVM-HDC dramatically lowers the barrier to entry for both algorithm researchers wanting to test on real hardware and hardware architects looking to demonstrate the value of their custom designs. This is a foundational contribution that could significantly accelerate the maturation of the HDC field.
Excellent Scope and Completeness: The system is impressively comprehensive for a first paper. It is not just a language proposal or a single-target compiler. The authors have delivered an end-to-end solution that considers:
- A high-level programming model (HDC++) that captures the essential operations of HDC (Section 3, page 5).
- A retargetable compiler intermediate representation (HPVM-HDC IR) that can express both HDC primitives and generic parallelism (Section 4.1, page 5).
- Backends for a truly heterogeneous set of targets, spanning the spectrum from general-purpose (CPU/GPU) to highly specialized (ASIC/ReRAM) hardware (Section 4.3, page 8). The demonstration on a real, taped-out ASIC is particularly compelling.
- Domain-specific optimizations that exploit the error-resilient nature of HDC, which is one of its key selling points (Section 4.2, page 7). The evaluation of these trade-offs in Figure 7 (page 11) is well-executed.
Contextualization and Vision: The paper is well-positioned within the broader landscape. The authors correctly identify the "programmability wall" as the primary obstacle to the adoption of HDC accelerators. They build upon a known, robust compiler framework (HPVM, Section 2.3, page 4) and clearly differentiate their work from existing HDC libraries (which lack accelerator support) and the only other known HDC compiler, HDCC (which is CPU-only and less flexible), as discussed in the Related Work section (Section 6, page 12). The work embodies a forward-looking vision for hardware-software co-design in the post-Moore era.

Weaknesses

Modest Performance Gains on General-Purpose Hardware: While the performance is claimed to be "competitive," the 1.17x geomean speedup over optimized baseline CUDA implementations (Figure 5, page 9) is not, in itself, a groundbreaking performance result. However, this is a minor weakness in the context of this paper's primary goals. The key achievement is attaining this level of performance while providing portability and improved programmability. The value proposition is the combination of all three, not just raw speed on one platform. The authors should be careful to frame this result as achieving performance-parity with hand-tuned code, which is a major success for a retargetable compiler.
Limited Generality of Accelerator Backend: The two custom accelerators targeted (the digital ASIC and ReRAM simulator) appear to expose similar, coarse-grained, high-level interfaces (e.g., monolithic execute_inference calls, as shown in Listing 6, page 8). It is not yet clear how well the HPVM-HDC compilation approach would generalize to future HDC accelerators that might expose finer-grained, more diverse hardware primitives. The current approach of mapping high-level inference_loop primitives directly to accelerator APIs is effective here but may not be a universally applicable strategy.
Ad-Hoc Nature of HDC++: The HDC++ language is a pragmatic extension of C++ using special types and compiler intrinsics. While this is a perfectly reasonable starting point, it lacks the elegance and formal semantics of a standalone DSL. This could present challenges for future extension and formal analysis. This is more of an observation on the design trade-offs than a critical flaw.

Questions to Address In Rebuttal

Regarding the 1.17x geomean speedup over baseline CUDA: Can the authors provide more insight into the source of this speedup? Is it due to fundamentally better code generation strategies in HPVM-HDC, or is it more a result of different (and perhaps better) tuning choices made during the implementation of the HDC++ versions compared to the available baselines? Clarifying this would help contextualize the performance claims.
The paper presents an excellent first step in targeting custom accelerators. Could the authors comment on the extensibility of the HPVM-HDC compiler framework? Specifically, how would the compilation strategy need to adapt if a new HDC accelerator were introduced that did not have a coarse-grained "inference" or "training" instruction, but instead exposed lower-level primitives like "parallel XOR" or "bundled addition"? Does the IR have the capacity to target such an architecture effectively?
The HDC++ language introduces a set of 24 primitives (Table 1, page 6). How was this set of primitives chosen? What is the anticipated process for extending the language and compiler should new, fundamental HDC operations emerge from algorithm research in the future?

Review 3

Excellent. I will now assume the persona of "The Innovator" and provide a peer review based exclusively on the novelty of the presented work.

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents HPVM-HDC, a programming system comprising a C++-based language (HDC++) and a compiler for generating code for heterogeneous targets, including CPUs, GPUs, and two specific Hyperdimensional Computing (HDC) accelerators (a digital ASIC and a ReRAM-based design). The authors claim this is the first system to enable productive, portable development of HDC applications across such a diverse set of hardware. The core novel claim rests on the creation of a complete, retargetable compilation framework for the HDC domain, moving beyond existing CPU-only compilers or Python-based libraries.

My assessment is that the central contribution—the integrated, heterogeneous compilation system for HDC—is indeed novel. While individual components build upon existing concepts (HDC primitives, compiler infrastructures, approximation techniques), their synthesis into a unified framework that can target custom HDC hardware is a new and significant contribution to the field.

Strengths

A Genuinely Novel System for a Niche Domain: The primary strength of this work is its novelty in addressing a clear gap. The prior art in HDC programming tools consists of:
- Libraries: TorchHD [19], OpenHD [30], etc., are Python libraries, not compilers. They lack a formal intermediate representation (IR) and the ability to be retargeted to novel hardware architectures like the ASICs presented here. They are bound by the backend of their host framework (e.g., PyTorch).
- Domain-Specific Compilers with Limited Scope: The authors correctly identify HDCC [61] as the closest prior art. However, HDCC is a source-to-source compiler targeting only multi-core CPUs with POSIX threads. The "delta" here is substantial: HPVM-HDC introduces a proper compiler IR and targets a truly heterogeneous set of devices, including GPUs and, most importantly, custom accelerators. This leap from a single-target compiler to a multi-target, hardware-agnostic framework is the paper's core innovation.
First Demonstration of Application Portability to Custom HDC Hardware: A key element of the novelty is the successful targeting of the digital ASIC and the ReRAM accelerator simulator (Section 5.2, page 10). To my knowledge, no prior work has demonstrated a high-level programming language and compiler capable of generating code for these specific, research-grade accelerators from the same source code that runs on a CPU/GPU. This is a "zero to one" contribution in terms of enabling software for this class of hardware.
Novel Integration of Domain-Specific Optimizations: While the concepts of binarization and skipping computations (perforation) are not new in the broader context of machine learning acceleration (they are analogous to quantization and pruning), their implementation as automated, IR-level compiler transformations within an HDC-specific framework is novel. Section 4.2 describes a formal approach to applying these approximations that goes beyond ad-hoc, manual code changes seen in prior HDC application papers.

Weaknesses

Reliance on an Existing Compiler Framework: The novelty is somewhat circumscribed by the fact that the system is built upon the existing HPVM compiler infrastructure [13]. The paper is transparent about this, but it's crucial to state that the novelty is not in the compiler's core architecture (the hierarchical dataflow graph representation) but in its specialization for the HDC domain. The contribution is a new front-end (HDC++) and new back-ends, not a new compiler paradigm.
Primitives are Not Fundamentally New: The HDC++ language itself is primarily a collection of primitives (Table 1, page 6) that encapsulate well-known HDC operations (e.g., random projection, Hamming distance). The language design does not introduce novel concepts for expressing parallelism or computation beyond what is necessary to represent these operations for the compiler. Its novelty is functional—as an effective interface to the compiler—rather than conceptual.
Coarse-Grained Accelerator Interface: The novelty of the accelerator targeting is slightly diminished by the coarse-grained nature of the interface (encoding_loop, training_loop, inference_loop as described in Section 4.3, page 8). This approach maps large, monolithic stages of an algorithm to the hardware. While practical and effective for the specific accelerators targeted, it represents a less sophisticated compilation strategy than, for example, generating a custom instruction stream from finer-grained primitives. The novelty is in making the connection, not necessarily in the sophistication of the code generation for the accelerators themselves.

Questions to Address In Rebuttal

Distinction from General-Purpose ML Compilers: Extensible compiler frameworks like MLIR and TVM are explicitly designed to support new domains and hardware targets through new "dialects" and lowering passes. What is the fundamental novelty of the HPVM-HDC approach that could not have been achieved by defining an "HDC dialect" in MLIR? Is the contribution primarily an engineering one based on the authors' familiarity with HPVM, or is there a core architectural reason why HPVM is uniquely suited for this task in a way that represents a novel compilation strategy?
Generalizability of the Accelerator Backend: The paper demonstrates targeting two specific accelerators with similar coarse-grained programming models. How general is this backend design? If a third party designed a new HDC accelerator with a fundamentally different, finer-grained instruction set (e.g., focusing on permutation operations), would the HPVM-HDC framework require a complete backend rewrite, or are there novel, reusable components in your accelerator code generation strategy?
Novelty of the IR Extensions: The paper states that HDC primitives are represented as LLVM intrinsics within the HPVM-HDC IR (Section 4.1, page 5). Beyond simply adding new intrinsics, were any novel extensions or modifications required for the HPVM dataflow graph IR itself to efficiently represent the unique data dependencies or computational patterns found in HDC (e.g., the manipulation of entire hypervectors as atomic units)? If not, it would suggest the novelty lies more in the front-end and back-end rather than the core representation.

Synchronization for Fault-Tolerant Quantum Computers

Abstract

Quantum Error Correction (QEC) codes store information reliably in logical qubits by encoding them in a larger number of less reliable qubits. The surface code, known for its high resilience to physical errors, is a leading candidate for fault-tolerant ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the perspective of "The Guardian."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors address the problem of desynchronization between logical qubits in fault-tolerant quantum computing systems, a scenario that arises from heterogeneous code usage, fabrication defects, or other sources of non-uniform syndrome cycle times. They propose and evaluate three synchronization policies: a baseline 'Passive' policy where the leading qubit idles; an 'Active' policy that distributes this idle time across multiple error correction cycles; and a 'Hybrid' policy that combines the Active approach with running additional error correction rounds. Through simulation and small-scale hardware experiments on physical qubits, the authors claim that their Active and Hybrid policies significantly reduce the logical error rate (LER) by up to 2.4x and 3.4x, respectively, compared to the Passive baseline. They further claim this LER reduction translates to a decoding latency speedup of up to 2.2x.

Strengths

Problem Motivation: The paper correctly identifies that desynchronization will be a necessary consideration in future large-scale, heterogeneous FTQC systems. The motivation provided in Section 1 and Section 3.2 is sound and highlights a practical systems-level challenge.
Intuitive Core Idea: The central concept of the 'Active' policy—distributing a long, error-prone idle period into smaller, less damaging segments—is physically intuitive. The supporting experiment on IBM hardware (Figure 6, page 6) demonstrates this principle effectively at the physical qubit level.
Comprehensive Policy Space: The paper defines a clear and logical set of policies, progressing from the simplest baseline (Passive) to more complex optimizations (Active, Hybrid), which provides a structured framework for the analysis.

Weaknesses

My primary concerns with this submission relate to the oversimplification of the error model, a significant logical leap between physical hardware experiments and logical qubit claims, and the unsubstantiated generality of the performance results.

Insufficiently Realistic Error Model: The paper's central claims rest on the magnitude of the benefit from mitigating idling errors. However, the error model used for simulations (Section 6, page 9) is a standard, uncorrelated Pauli twirl model. This model is known to be optimistic and ignores several critical, real-world effects that the authors' own cited literature [1, 2] has shown to be dominant sources of error. Specifically:
- Leakage: The model does not account for qubit leakage, which is exacerbated by measurement and reset operations fundamental to the surface code cycle. Leakage errors are not simple Pauli errors and can propagate in complex ways.
- Correlated Errors & Crosstalk: The model treats errors as independent events on each qubit. In reality, idling qubits are subject to crosstalk from neighboring active qubits, leading to spatially and temporally correlated errors that are far more challenging for the surface code decoder to handle. The benefit of the Active policy, which interleaves short idles with gate activity, could be significantly diminished or even negated by increased exposure to crosstalk.
- The claim that the model is "conservative" (page 9) is unsubstantiated. A model that ignores known dominant error mechanisms is not conservative; it is incomplete.
Unjustified Generalization from Physical to Logical Qubits: The authors present experiments on IBM hardware (Figure 1c, Figure 6c) as evidence for their approach. However, these experiments are performed on isolated physical qubits. It is a profound and unsupported leap to assume that the percentage benefit observed for a simple DD sequence on a single physical qubit will translate directly to a complex, multi-qubit entangled state like a surface code patch. The error dynamics of an idling logical qubit, where both data and measure qubits are susceptible to decoherence and crosstalk within a repeating cycle of measurements, are fundamentally different. The paper provides no bridge—either theoretical or through more sophisticated simulation—to justify this crucial generalization.
Arbitrary Parameterization of the Hybrid Policy: The Hybrid policy's performance is critically dependent on the choice of the "slack tolerance" epsilon (Equation 2, page 7). The authors state, "we use a larger value of e = 400ns for all evaluations" (Section 4.2.1, page 8). This value appears arbitrary and fine-tuned. There is no methodology presented for how one would determine the optimal epsilon in a real system, nor is there a sensitivity analysis showing how the claimed 3.4x LER reduction (Table 4, page 11) varies with different choices of epsilon. Without this, the Hybrid policy appears to be a brittle optimization rather than a robust protocol.
Inconsistency Between Motivation and Evaluation: The paper is strongly motivated by the need for synchronization in heterogeneous systems using different QEC codes (e.g., surface, color, qLDPC, as shown in Figure 1a). However, the core evaluations and simulations "restrict our evaluations to only surface code patches" (Section 6, page 9). While the authors justify this by decoder availability, it creates a disconnect. The primary claims of the paper are not evaluated in the very context used to motivate their importance. The evaluation of Tp != Tp' with two surface code patches of slightly different cycle times does not capture the full complexity of synchronizing fundamentally different codes.
Secondary Claim on Decoding Speedup is Confounded: The claimed decoding speedup of up to 2.2x (Figure 22, page 12) is not a direct result of the synchronization policy but an indirect artifact of the specific hierarchical decoder architecture (LUT + MWPM) assumed for the analysis. The speedup is attributed to a higher LUT hit rate due to a lower error rate. This result is not generalizable. A different decoder architecture, such as a fast monolithic belief-propagation decoder or a neural network decoder, may not exhibit this LUT/miss behavior, and thus the performance benefit would not materialize. The claim should be more cautiously framed as being specific to this class of decoders.

Questions to Address In Rebuttal

Please provide a rigorous justification for why a simple, uncorrelated Pauli error model is sufficient for this analysis. Specifically, how would the inclusion of leakage and spatially correlated crosstalk errors be expected to alter the relative performance advantage of the Active policy over the Passive policy?
The Hybrid policy's efficacy hinges on the parameter epsilon. Please provide a sensitivity analysis for the LER reduction as a function of epsilon. What is the principled methodology a system designer should use to select an optimal epsilon for a given hardware platform?
Please provide a stronger argument to bridge the experimental results on single physical qubits (Figure 6) with the simulation results for logical qubits (Figure 14). Why should the dynamics of mitigating decoherence in these two vastly different regimes be considered directly comparable?
The motivation for this work is strongly rooted in heterogeneous architectures, yet the simulations are restricted to homogeneous surface code patches. Please clarify exactly what source of desynchronization (e.g., differing gate times, patch layout) was modeled in the simulations that show Tp != Tp' and justify why this is a sufficient proxy for the broader problem of code-level heterogeneity.
Can the authors defend the generality of the 2.2x decoding speedup claim? How would this performance benefit change if a non-hierarchical, high-speed monolithic decoder were used instead of the assumed LUT+MWPM architecture?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper identifies and addresses a critical, yet often overlooked, systems-level challenge for future fault-tolerant quantum computers (FTQC): the desynchronization of logical operations. The authors correctly identify that as FTQC architectures mature, they will likely become heterogeneous, employing different QEC codes for different purposes (e.g., surface codes for compute, qLDPC for memory, color codes for magic states), and will need to tolerate fabrication defects. These factors disrupt the lockstep execution of syndrome extraction cycles across different logical qubits ("patches"), creating a "synchronization slack" that must be resolved before multi-qubit operations like Lattice Surgery can proceed.

The core contribution is the proposal and evaluation of three synchronization policies. The baseline "Passive" policy simply idles the leading logical qubit, which the authors show significantly increases the logical error rate (LER). Their primary proposal, the "Active" policy, elegantly mitigates this by distributing the total idle time into smaller, less-damaging increments across multiple error correction cycles. The "Hybrid" policy further refines this by combining the Active approach with running additional error correction rounds when cycle times differ. The work demonstrates through simulation that these policies can reduce the LER by up to 3.4× and, consequently, improve performance by speeding up decoding by up to 2.2×.

Strengths

Excellent Problem Formulation and Motivation: The paper’s primary strength is its clear articulation of a crucial engineering problem that lies at the intersection of QEC theory, compiler design, and hardware architecture. By framing the issue as one of "logical clocks" and providing concrete, forward-looking motivations (heterogeneous codes in Section 3.2.1, page 4; dropouts in Section 3.2.2, page 4), the authors make a compelling case for why synchronization is not a niche issue but a fundamental requirement for scalable FTQC.
Elegant and Intuitive Core Idea: The proposed "Active" synchronization policy is a simple, powerful, and physically well-motivated idea. The insight that multiple short idle periods are less harmful than one long one is intuitive, and the paper does an excellent job of quantifying this intuition through both small-scale hardware experiments (Figure 6, page 6) and large-scale simulations. This simplicity makes the proposed solution highly practical.
Connecting to the Broader System: A standout feature of this work is its ability to connect the proposed low-level synchronization policy to higher-level system performance. The analysis in Section 7.5 (page 12), which links the LER reduction from the Active policy to a tangible speedup in a hierarchical decoder, is an exemplary piece of systems-level thinking. It shows that the benefits are not merely academic but could translate into faster, more efficient quantum computation.
Contextualization within Architectural Trends: This work is perfectly timed. The quantum computing community is actively exploring beyond monolithic, homogeneous surface code architectures. This paper provides a key piece of the puzzle for making proposed heterogeneous systems [9, 80] and defect-tolerant layouts [24, 74] viable in practice by providing a mechanism to manage the temporal inconsistencies they inevitably create. It essentially provides the "temporal glue" for these advanced architectural concepts.

Weaknesses

While the work is strong, its potential could be further enhanced by addressing the following points, which are less criticisms than areas for future exploration:

Simplified Error Model: The analysis relies on a standard, but simplified, circuit-level noise model with depolarizing and Pauli twirled idling errors (Section 6, page 9). While a necessary starting point, real-world systems will feature more complex error mechanisms, such as leakage and correlated crosstalk, which may be exacerbated by the start-stop nature of the Active policy. The impact of such correlated errors on the proposed policies remains an open question.
Scalability in Complex Parallel Workloads: The proposed method for synchronizing k-patches by aligning all to the slowest patch (Section 4.3, page 8) is a sensible heuristic. However, in a complex algorithm with high parallelism, there may be many independent groups of patches requiring synchronization simultaneously. This could create cascading dependencies or scheduling bottlenecks that are not captured by the two-patch analysis. The constant-time claim holds for a single operation but the system-wide impact is less clear.
Control System Overhead: The paper proposes a microarchitecture for managing synchronization at runtime (Figure 12, page 8) and plausibly argues it sits outside the critical path. However, the practical complexity and resource cost (e.g., in terms of classical logic, memory, and power) of tracking the phase of thousands or millions of logical patches in real-time is non-trivial and warrants a more detailed analysis.

Questions to Address In Rebuttal

The Hybrid policy's effectiveness depends on the choice of the slack tolerance ε. As discussed in Section 4.2.1 (page 8), this seems to be a system-dependent hyperparameter. Could the authors elaborate on how one might determine an optimal ε in practice? Would this require extensive characterization and calibration of a given QPU's noise profile, or are there more general heuristics?
The core assumption of the Active policy is that the harm of idling is super-linear with time, making division beneficial. While true for decoherence, could the repeated stopping and starting of the syndrome cycle (i.e., inserting idle gates) introduce other error modes not captured in the current model, such as those related to control signal transients or state-dependent crosstalk, that might diminish its benefits?
Regarding scalability, can you comment on the potential interaction between the synchronization policy and the logical algorithm scheduler? For instance, could a scheduler, aware of the cost of synchronization, reorder operations to group patches with similar "logical clock speeds" to minimize the required slack, thereby working synergistically with your proposed policies? This seems like a promising avenue for cross-layer optimization.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper addresses the problem of logical clock desynchronization in fault-tolerant quantum computers (FTQC), a systemic issue arising from heterogeneous QEC codes, fabrication defects, or other sources of timing variability. The authors isolate and formalize this problem, which has been an implicit challenge but not a direct subject of prior architectural studies. The core of the paper is the proposal and evaluation of three distinct synchronization policies:

Passive: A baseline policy where the leading logical qubit idles, waiting for the lagging qubit to catch up before a joint operation like Lattice Surgery.
Active: The primary novel proposal, where the total synchronization slack is broken into smaller chunks and distributed as short idle periods between multiple syndrome extraction cycles of the leading qubit.
Hybrid: An extension of the Active policy for cases where cycle times differ, which combines distributed idling with running a calculated number of additional error correction rounds to minimize both idling and computational overhead.

The authors evaluate these policies via simulation, demonstrating that the Active and Hybrid policies significantly reduce the logical error rate (LER) compared to the naive Passive approach.

Strengths

The primary strength of this work lies in its novelty, which can be broken down into two distinct components:

Problem Formalization: While the causes of desynchronization (e.g., different cycle times for different codes [11, 80], or defects [24]) are known, this paper is the first I am aware of to treat the mechanism of resynchronization as a first-class architectural problem. It moves beyond acknowledging the need for synchronization barriers and proposes concrete, evaluatable policies for implementing them.
The 'Active' and 'Hybrid' Policies: The core conceptual contribution—the Active policy—is genuinely novel in the context of FTQC architecture. The insight is to not treat the synchronization slack as a monolithic block of idle time but to "amortize" its detrimental effects by interleaving it with computation. This is a clear departure from the baseline "wait" approach. While conceptually analogous to "race-to-the-deadline" power management techniques in classical processors, its application to mitigate decoherence errors in a quantum system with a completely different cost model (errors vs. energy) is a new and significant contribution. The Hybrid policy is a logical and also novel extension, creating a trade-off space between idling and running extra QEC rounds, which has not been previously proposed.

Weaknesses

From the perspective of novelty, the weaknesses are minor and relate more to the boundaries of the contribution rather than a fundamental lack of new ideas.

Obviousness of the Baseline: The "Passive" policy is the default, trivial solution that any system designer would first consider. While necessary for establishing a baseline, it holds no novelty itself. The paper's contribution rests entirely on the improvements offered by the Active and Hybrid policies over this strawman.
Incremental Nature of the 'Hybrid' Policy: While novel in its formulation for this problem, the Hybrid policy is an incremental optimization on top of the core Active policy. It combines the Active policy's insight with the well-understood principle of trading time for computation (running more QEC cycles). The novelty lies in the specific synthesis and application, not in the constituent parts.

Questions to Address In Rebuttal

The authors should address the following points to further solidify the novelty of their contribution:

Relation to Classical Systems: The "Active" policy is conceptually similar to Dynamic Voltage and Frequency Scaling (DVFS) in classical CPUs, where a processor slows down to meet a deadline exactly, avoiding a high-power "race-to-idle." Can the authors elaborate on the fundamental differences in applying this concept to a quantum system, where the cost of idling is not wasted energy but an increased probability of uncorrectable state corruption? This would help frame the novelty beyond a simple porting of a classical idea.
Prior Art in FTQC Compilers: The scheduling of logical operations is a key task for an FTQC compiler stack. Can the authors confirm that no prior work on Lattice Surgery compilation (e.g., [52], [90]) has proposed or implicitly implemented a scheme for distributing synchronization slack between QEC rounds? While I am not aware of any, a definitive statement would strengthen the claim of novelty.
The 'Active-intra' Policy: In Section 4.1.3 (Page 6), the paper introduces the "Active-intra" policy and then demonstrates its inferiority (Figure 17, Page 10). While this is scientifically sound, what is the novelty of this specific variant? Is this a known (but unevaluated) idea, or is it also being proposed for the first time here simply to be refuted? Clarifying its origin would be helpful.

SWIPER: Minimizing Fault-Tolerant Quantum Program Latency via Speculative Window Decoding

Abstract

Real- time decoding is a key ingredient in future fault-tolerant quantum systems, yet many decoders are too slow to run in real time. Prior work has shown that parallel window decoding can scalably meet throughput requirements in the presence of increasing ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose SWIPER, a technique to reduce quantum program latency by introducing speculative execution into the parallel window decoding process for surface codes. The core mechanism is a lightweight, heuristic-based predictor that anticipates the data dependencies (i.e., Pauli corrections) across window boundaries. This allows dependent decoding windows to begin execution before their predecessors have fully completed, thereby reducing idle time, particularly for time-sensitive blocking operations like T-gate teleportations. The authors introduce a new simulator, SWIPER-SIM, to evaluate this approach on a suite of benchmarks, claiming an average 40% reduction in program runtime compared to non-speculative parallel window decoding, at the cost of a ~30% increase in the number of concurrent classical decoders.

Strengths

The paper identifies a legitimate and subtle performance bottleneck in parallel window decoding schemes: the "alignment" of blocking operations with source/sink window boundaries (Section 5.2.1, page 9). The introduction of an "aligned window strategy" is a concrete contribution in its own right.
The development of the SWIPER-SIM tool appears to be a substantial engineering effort. A round-level simulator capable of modeling program execution with different windowing strategies is a valuable tool for architectural exploration in this domain.
The core idea of applying speculation, a well-established concept from classical computer architecture, to the problem of decoding is logical and well-motivated by the sequential dependencies present in windowed decoding.

Weaknesses

My primary concerns with this work center on the robustness of the core speculative mechanism and the potential for understated costs and overstated benefits.

Fragility of the Predictor: The heart of SWIPER is the 3-step predictor (Section 4.2). This is a greedy, local heuristic designed to find low-weight matchings across a boundary. Its performance is evaluated against what appears to be a standard, uncorrelated depolarizing error model (p=10⁻³ mentioned in Section 3.1). Quantum error correction, however, must be robust to all possible errors, including less frequent but more complex, non-local error chains. The predictor's effectiveness on such error patterns is not demonstrated. The sharp decline in accuracy for d > 25 (Figure 4, page 6) is alarming and suggests the heuristic's assumptions break down at scales relevant for challenging computations. The paper provides no analysis of the predictor's performance under more adversarial conditions, such as correlated noise, which could severely degrade accuracy and negate SWIPER's benefits.
Underestimated Cost of Misprediction: The paper advocates for an "optimistic" recovery strategy (Section 4.3.1, page 7), arguing that error propagation between adjacent boundaries is minimal. However, the final benchmark results in Figure 14a (page 12) show that the runtime overhead from recovering from these "minimal" mispredictions (the lighter portion of the bars) constitutes a very significant fraction of the remaining execution time. This finding appears to be in direct tension with the justification for the optimistic strategy. If misprediction recovery is this costly, the entire premise that speculation is a net win becomes questionable. Furthermore, the cost of a "Pauli frame rollback" is mentioned but not quantified; in a real hardware controller, this is a non-trivial state management operation that adds complexity and potential latency not accounted for in the simulation.
Questionable Baseline and Scope of Claims: The performance improvements are benchmarked against decoder latencies derived from PyMatching, a software implementation (Figure 3, page 5). As the authors acknowledge in their related work section, significant effort is being directed toward hardware-accelerated decoders (e.g., Helios [43], Micro Blossom [63]) which can achieve latencies orders of magnitude lower than software decoders for a given code distance. The entire benefit of SWIPER exists within the time gap tw (decoder latency). If tw is drastically reduced by specialized hardware, the window of opportunity for speculation shrinks or vanishes entirely, making the 40% improvement figure highly dependent on a potentially pessimistic assumption about future decoder hardware.
Inconsistent and Incomplete Resource Analysis: The abstract claims a "consistent 31% increase in the number of concurrent decoders." However, the analysis in Section 5.3.5 (page 12) and the corresponding Figure 15a states, "SWIPER uses approximately 24% more simultaneous decoding processors". This is a direct numerical contradiction that undermines confidence in the analysis. Moreover, the overhead analysis is incomplete. It focuses solely on the count of decoder units, neglecting the non-trivial hardware cost of the predictor itself (analyzed in Figure 7), the verification logic, and the significantly more complex control plane required to manage speculative states, issue rollbacks, and handle the entire speculative lifecycle. The true classical resource overhead is certainly higher than the reported 24-31% figure.

Questions to Address In Rebuttal

Predictor Robustness: Can you provide data on the predictor's accuracy (per Section 4.2) under a correlated noise model or against higher-weight error chains? Please provide a technical justification for the precipitous drop in accuracy for d > 25 shown in Figure 4. Is this a fundamental limitation of local heuristics for larger codes?
Misprediction Cost Contradiction: Please reconcile the advocacy for an optimistic recovery strategy (Section 4.3.1) with the empirical result that misprediction overhead forms a large part of the final program runtime (Figure 14a). What is the simulated cost, in cycles, of a single misprediction event, including detection, pipeline flush, Pauli frame rollback, and restart?
Resource Overhead Discrepancy: Please clarify the discrepancy between the 31% (Abstract) and 24% (Section 5.3.5) figures for the increase in required decoder units. Can you provide a more comprehensive analysis of the total classical hardware overhead, moving beyond just the decoder count to include the predictor, verifier, and control logic, perhaps as a percentage of the baseline parallel decoder system's total area/power?
Sensitivity to Baseline Decoder Latency: The claimed 40% improvement is contingent on the baseline decoder latency (tw). How do SWIPER's performance benefits degrade as tw is reduced by an order of magnitude or more, as might be expected from future dedicated hardware decoders? At what value of tw (relative to window generation time) do the benefits of SWIPER become marginal?

Review 2

Of course. Here is a peer review of the paper from the perspective of 'The Synthesizer.'

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces SWIPER, a novel architecture for quantum error correction decoding that elegantly applies the classical computer architecture principle of speculative execution to the problem of parallel window decoding. The authors identify that while parallel windowing solves the throughput problem for slow decoders, it creates latency bottlenecks due to data dependencies between windows, particularly for blocking operations like T-gates. SWIPER addresses this by using a lightweight, hardware-friendly predictor to speculate on the outcomes of these dependencies, allowing dependent decoding tasks to begin early. If the speculation is correct, program latency is significantly reduced; if incorrect, the system reverts to a state no worse than the non-speculative baseline. Through extensive simulation using a custom, program-level simulator (SWIPER-SIM), the authors demonstrate that SWIPER can reduce the runtime of fault-tolerant quantum benchmarks by an average of 40%, a remarkable improvement.

Strengths

The primary strength of this work is its brilliant cross-pollination of ideas, importing a cornerstone concept from classical computer architecture—speculative execution—into the domain of quantum error correction. This is not merely an application of an old idea but a deep insight into the structural nature of the decoding problem. It represents a significant step forward in architecting practical fault-tolerant systems.

A Powerful Conceptual Bridge: The authors have successfully framed the inter-window data dependency problem in a way that makes it amenable to classical speculation techniques. This connection between QEC decoding pipelines and classical processor pipelines is a powerful and generative one that could inspire further research at this intersection.
Shifting the Metric of Success: This work helps solidify a crucial and timely paradigm shift in the field. For years, the community has focused intensely on reducing the absolute latency of a single "inner" decoding operation to below the ~1µs syndrome cycle time. SWIPER, building on parallel windowing, compellingly argues that for overall program execution, the reaction time of blocking operations is a more meaningful metric. This frees designers from the 1µs straightjacket and opens the door to a richer design space.
Enabling Slower, More Powerful Decoders: A significant implication, highlighted well in Figure 12b, is that SWIPER provides substantial headroom (2-5x) in the latency requirements for the inner decoder. This is perhaps the paper's most impactful long-term contribution. It makes the use of more sophisticated but slower decoders—such as those based on neural networks [7] or tensor networks [13]—a practical possibility. Trading this newfound latency budget for higher decoding accuracy could lead to a net improvement in logical qubit fidelity, a critical goal for the entire field.
High-Quality Systems-Level Simulation and Analysis: The creation of SWIPER-SIM is a significant contribution in its own right. This tool enabled the authors to move beyond single-window analysis and evaluate the end-to-end impact on real quantum programs. The discovery of the "T-gate alignment" issue (Section 5.2.1, page 9) is a testament to the power of this systems-level approach and a valuable, practical finding for the community.
Compelling and Well-Supported Results: A 40% reduction in program runtime is a headline result that will capture the community's attention. The claim is well-supported by a comprehensive evaluation across a diverse set of benchmarks, a detailed analysis of misprediction overhead, and a realistic accounting of the required classical hardware resources.

Weaknesses

The weaknesses of the paper are minor in comparison to its strengths and largely represent opportunities for deeper exploration rather than fundamental flaws.

Heuristic Nature of the Predictor: The 3-step predictor is pragmatic and demonstrably effective. However, its design feels somewhat ad-hoc. The work would be strengthened by a more formal discussion of the trade-off between predictor complexity, hardware cost, and accuracy. It is unclear if there is a theoretical ceiling on the accuracy of such lightweight, local predictors.
Generalizability Beyond the Surface Code: The work is entirely situated within the context of the surface code. While this is the leading candidate code, the principles of windowed decoding and speculation are more general. A discussion on how these ideas might apply to other code families, such as the qLDPC codes mentioned in the conclusion, would broaden the perceived impact of the work. The local, sparse nature of surface code errors is key to the success of both the predictor and the "optimistic" recovery strategy; it's an open question how this would translate to codes with different structures.
Analysis of Wasted Work: The paper quantifies the required increase in concurrent classical decoders well (Figure 15). However, a more direct analysis of the total "wasted" classical compute cycles due to mispredictions would provide a fuller picture of the overhead. While classical computation is "cheaper" than quantum computation, it is not free, and understanding this cost is important for system designers.

Questions to Address In Rebuttal

The potential to use the gained latency headroom for more accurate decoders is a key selling point. Can the authors provide a back-of-the-envelope calculation to make this more concrete? For example, if the 2-5x latency budget (from Figure 12b) were used for an RNN decoder [7] that offers a hypothetical 2x reduction in logical error rate, what is the net effect on the overall Space-Time Volume of a benchmark application?
The optimistic recovery strategy is well-justified by the locality of errors in the surface code under standard depolarizing noise. How do the authors expect SWIPER's performance (both prediction accuracy and misprediction recovery cost) to change in the presence of spatially or temporally correlated noise, where an error event might have non-local consequences that violate the assumptions of the optimistic strategy?
The predictor's accuracy is evaluated at a physical error rate of p=10⁻³. As physical qubit quality improves and this rate drops, one would expect error instances to become sparser and simpler. This should, in principle, make the lightweight predictor even more effective. Have the authors considered how prediction accuracy and the overall runtime improvement from SWIPER might scale as the physical error rate improves?

Review 3

Here is the review from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces SWIPER, a speculative window decoder for fault-tolerant quantum error correction (QEC). The core idea is to apply the principle of speculative execution, a concept borrowed from classical computer architecture, to the problem of resolving data dependencies between adjacent decoding windows. Instead of waiting for a predecessor window to be fully decoded, a dependent window begins its own decoding process speculatively, using a predicted set of boundary conditions generated by a novel, lightweight predictor. Once the predecessor window completes, the speculation is verified. In case of a misprediction, the speculative work is discarded and restarted—a rollback mechanism. The authors claim this approach reduces the critical path latency of blocking operations, such as T gates, thereby decreasing overall quantum program runtime.

Strengths

The primary strength of this paper is the high degree of novelty in its central claim. My analysis confirms that the core concept presented is genuinely new to the field of QEC decoding.

Novel Application of a Cross-Domain Concept: The authors explicitly state their inspiration from classical branch prediction (Abstract, page 1). While speculation is a cornerstone of high-performance classical processors, its application to break data dependencies between pipelined QEC decoding windows has not been proposed before. Prior work on parallel window decoding (e.g., Skoric et al. [55], Tan et al. [56]) establishes a rigid, non-speculative dependency graph. SWIPER’s contribution is to fundamentally challenge this paradigm by introducing a speculative data flow, which is a significant conceptual leap.
Well-Defined Delta from Prior Art: The paper clearly articulates its novelty against the closest related works. In Section 7 (Related Work, page 13), the authors correctly identify that prior parallel windowing schemes "all assume that windows with dependencies wait until their dependencies are completely decoded." SWIPER’s introduction of a speculation step is the key differentiator that enables the overlapping of dependent tasks. This is a clear and defensible "delta."
Novel Supporting Mechanisms: The contribution is not merely a high-level idea. The authors propose a concrete and novel implementation: a 3-step, O(1) runtime predictor (Section 4.2, page 5-6). This predictor is itself a novel piece of engineering designed specifically for the surface code boundary problem. Furthermore, the analysis of misprediction handling strategies (pessimistic vs. optimistic, Section 4.3.1, page 7) and the identification of the "T Gate Alignment" problem (Section 5.2.1, page 9) are novel insights that stem directly from exploring their new speculative framework.

Weaknesses

While the core concept is novel, the novelty of some of the constituent parts could be further contextualized.

Incremental Novelty of the Predictor: The predictor design, while effective, is a heuristic based on well-understood properties of the surface code (i.e., that most error chains are short and low-weight). It is a clever construction but does not represent a fundamentally new approach to error prediction in the vein of, for instance, a new machine learning model. The paper’s contribution lies in identifying the need for such a predictor and showing a simple one is sufficient, but the predictor's algorithmic novelty is limited.
Unexplored Design Space: The paper presents one specific instance of a speculative decoder. The novelty could be strengthened by discussing alternative implementations. For example, could an intermediate state from an iterative decoder (like belief propagation or sparse blossom [34]) provide a "good enough" prediction without a separate predictor module? The proposed mechanism introduces significant new complexity (predictor logic, verifiers, rollback on the Pauli frame unit), and it is not yet clear if this specific design point is optimal.

Questions to Address In Rebuttal

Verification Latency and Overhead: The speculative pipeline relies on a "Verify Speculation" step after the predecessor window completes (Figure 1b, page 1). The paper's performance model does not appear to account for the latency of this verification step itself (i.e., the time to compare the predicted boundary bits with the actual results and signal a potential rollback). Can the authors quantify this latency and explain how it impacts the critical path? If verification is non-trivial, it could erode some of the claimed latency benefits.
Conceptual Distinction from Hierarchical Decoding: Can the authors further elaborate on the conceptual distinction between SWIPER and hierarchical/multi-level decoding schemes? For instance, works like Clique [48] and Promatch [3] use a lightweight first-pass decoder to handle easy cases within a single window before invoking a full decoder. One could frame SWIPER as a two-level scheme where a fast, approximate decoder (the predictor) runs on the boundary, followed by a full decoder on the window. Please clarify why speculating on inter-window dependencies is a fundamentally different and more novel approach than accelerating intra-window decoding with a pre-decoder.
Predictor Generality: The 3-step predictor is tailored to the constant-weight parity checks of the surface code, leading to its O(1) runtime (Key Insight, page 6). How does the core idea of SWIPER generalize to codes without this convenient structure, such as qLDPC codes, where the predictor design might be substantially more complex? Does the novelty of SWIPER hinge on the specific structure of the surface code?

CaliQEC: In-situ Qubit Calibration for Surface Code Quantum Error Correction

Abstract

Quantum Error Correction (QEC) is essential for fault-tolerant, large-scale quantum computation. However, error drift in qubits undermines QEC performance during long computations, necessitating frequent calibration. Conventional calibration methods ...

ACM LINK

Reviews

Review 1

Here is a peer review of the paper from the perspective of 'The Guardian'.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present CaliQEC, a framework designed to perform in-situ calibration of physical qubits during the execution of surface code quantum error correction. The central thesis is that error drift, a critical obstacle for long-running quantum computations, can be mitigated without halting the entire computation. The proposed method leverages the theory of code deformation to selectively isolate drifting qubits for calibration, while dynamically enlarging the code patch to maintain the required level of error protection. The framework includes a preparation stage for device characterization, a compile-time scheduling algorithm to group and sequence calibration tasks, and a runtime system that applies the deformation instructions. The authors support their claims with simulations of large-scale applications and experiments on a small d=3 surface code implemented on existing quantum hardware.

Strengths

The paper addresses a well-recognized and critical problem in fault-tolerant quantum computing. Error drift is a fundamental limitation, and any practical solution would be of significant interest.
The core proposal to repurpose code deformation, a known theoretical tool for logical operations, for the purpose of dynamic qubit isolation is a valid and interesting line of inquiry.
The development and formalization of a dedicated instruction set for the heavy-hexagon topology (Section 6, page 7) represents a concrete technical contribution, as this architecture is prevalent in state-of-the-art hardware and cannot use square-lattice instructions directly.
The error drift model is, at least in part, grounded in measurements from a real quantum device (Fig. 9, page 9), which lends some credibility to the simulation parameters.

Weaknesses

My analysis has identified several areas where the paper's claims are insufficiently substantiated, and the methodology raises significant concerns about the validity and generalizability of the results.

The "Logical Swap for Calibration" (LSC) Baseline is a Strawman: The paper's primary quantitative claims hinge on the dramatic outperformance of CaliQEC over the LSC baseline. However, the LSC baseline as described in Section 7.3 (page 10) appears to be non-optimally designed. The assumption of a "roughly 4x qubit overhead" from a "straightforward 2D expansion" seems to represent a worst-case, naive implementation of state swapping. A more sophisticated scheduling system could utilize communication channels more efficiently or employ teleportation-based schemes with different resource trade-offs. By comparing against this simplistic baseline, the claimed 363% qubit overhead reduction by CaliQEC is likely inflated. A rigorous study would compare against a more competitive state-of-the-art method for state relocation.
Execution Time Claims Lack Rigor and Plausibility: The central claim of "negligible" execution time overhead is not supported by the evidence. In Table 2 (page 11), CaliQEC is reported to have exactly zero execution time overhead compared to the "No Calibration" baseline across all benchmarks. This is physically implausible. The processes of code deformation, measurement, qubit reintegration, and stabilizer remeasurement all require physical time. While these may run concurrently with computation in other parts of the chip, they must surely impact the QEC cycle time in the affected region, which would propagate to total execution time for any algorithm with data dependencies across the code patch. The paper fails to provide any breakdown of this timing or justify how it can be completely absorbed without penalty.
Extrapolation from Small-Scale Experiments is Unjustified: The hardware validation in Section 8.3 (page 12) is performed on a distance d=3 surface code. The primary simulation results in Table 2, however, are for codes with distances ranging from d=25 to d=47. The physics of error propagation, the complexity of decoding, and the potential for correlated errors from deformation operations do not necessarily scale linearly. A demonstration on a toy-sized d=3 code, which has limited error correction capability, provides insufficient evidence to validate claims about performance on large, practical code distances. The logical leap from d=3 to d=47 is substantial and unsupported.
Key Model Assumptions are Not Adequately Justified:
- Error Drift Model: The authors adopt an exponential drift model (Eq. 1, page 5), while acknowledging that "some references report a linear drift model." The choice of an exponential model, which shows faster degradation, could make the need for frequent calibration appear more urgent, thereby favoring their solution. No sensitivity analysis is provided to show how the system would perform under a different, potentially more realistic, drift model.
- Crosstalk Characterization: The method for identifying crosstalk-affected qubits nbr(g) (Section 4, page 5) relies on detecting "deviations beyond a threshold." This threshold is a critical hyperparameter that is neither defined nor justified. The size of the isolated region, and thus the entire space-time overhead, is highly sensitive to this value. Without a clear and defensible methodology for setting this threshold, the results cannot be considered robust.

Questions to Address In Rebuttal

The authors must address the following points directly to establish the credibility of their work:

Please provide a justification for the LSC baseline. Specifically, why is a "straightforward 2D expansion" considered a fair point of comparison, as opposed to more resource-optimized logical swap protocols described in the literature?
Provide a detailed, quantitative breakdown of the execution time overhead. How can the sequence of operations required for deformation and calibration (e.g., DataQ_RM, PatchQ_AD) be implemented with precisely zero impact on total program runtime as claimed in Table 2? What is the effect on the local QEC cycle duration?
Address the significant discrepancy in code distance between the hardware demonstration (d=3) and the primary simulation results (d=25-47). What theoretical or experimental evidence can you provide to support the claim that the performance benefits and error suppression capabilities of CaliQEC will hold when scaling up by more than an order of magnitude in code distance?
How sensitive are your scheduling and overhead results to the choice of an exponential error drift model? Please provide data on how the system would perform if a linear drift model were assumed instead.
What is the specific, quantitative definition of the "threshold" used to determine crosstalk-affected qubits in Section 4? Please provide a sensitivity analysis showing how the qubit and time overheads change as this threshold is varied.

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents CaliQEC, a comprehensive framework for performing in-situ calibration of physical qubits during a running, surface-code-protected quantum computation. The work is motivated by the critical problem of "error drift," where the performance of physical qubits degrades over time, jeopardizing the long computations required for fault-tolerant quantum computing (FTQC). The core technical contribution is the novel repurposing of code deformation, a technique typically used for implementing logical gates, as a mechanism for physical hardware maintenance. By dynamically modifying the surface code structure, CaliQEC can temporarily isolate drifting qubits for calibration and then reintegrate them, all without halting the logical computation. This mechanism is supported by a full-stack, compiler-level approach that includes preparation-time device characterization, drift-based calibration grouping, and an adaptive scheduling algorithm to manage the space-time resource trade-offs. The authors evaluate CaliQEC against two sensible baselines—no calibration and a naive "Logical Swap for Calibration" (LSC) approach—demonstrating through simulation and small-scale hardware experiments that their method can maintain a target logical error rate with minimal qubit and time overhead.

Strengths

Addresses a Foundational, System-Level Problem: The most significant strength of this work is that it tackles a problem of fundamental importance to the future of FTQC. While much of the literature focuses on designing better codes or faster decoders under a static noise model, this paper confronts the messy reality of dynamic, time-varying hardware. Error drift is a well-known but often-overlooked roadblock that stands between current NISQ devices and future fault-tolerant machines. By providing a practical solution, this work bridges a crucial gap between QEC theory and experimental reality. The analogy drawn to DRAM refresh in classical computing (Section 1, page 2) is particularly apt and effectively frames the problem for the broader computer architecture community.
Elegant Repurposing of an Existing Technique: The core insight—to use code deformation for hardware maintenance—is both clever and powerful. Code deformation is a known primitive in the surface code literature, primarily for implementing logical gates and moving logical qubits (e.g., lattice surgery). The authors have recognized that this same tool for manipulating logical information can be masterfully repurposed for managing the health of the physical substrate. This is a beautiful example of cross-pollination, applying a concept from the logical layer to solve a problem at the physical layer, thereby creating a tightly integrated, cross-layer solution.
Holistic and Complete Framework: This is not merely a theoretical proposal; it is a well-thought-out systems paper. The CaliQEC framework is comprehensive, encompassing the entire lifecycle of the problem:
- Characterization (Section 4): A practical method for measuring the key physical parameters (drift rate, calibration time, crosstalk) needed to inform the strategy.
- Compilation (Section 5): A sophisticated, two-stage scheduling algorithm that intelligently groups calibration tasks and schedules them to balance parallelism against resource overhead.
- Runtime (Section 6): The formalization of dedicated instruction sets for both square and heavy-hexagon lattices, demonstrating a clear path to implementation on real-world hardware topologies.
Strong and Persuasive Evaluation: The experimental design is excellent. The choice of baselines is perfect for highlighting the contribution: "No Calibration" demonstrates the necessity of a solution, while "Logical Swap for Calibration" (LSC) represents a plausible but naive alternative that effectively underscores the efficiency and fine-grained nature of CaliQEC. The results presented in Table 2 (page 11) are compelling, showing a dramatic reduction in qubit overhead (e.g., from 363% for LSC to ~24% for CaliQEC) while successfully managing the retry risk. The inclusion of small-scale experiments on real Rigetti and IBM hardware (Section 8.3, page 12) provides a crucial proof-of-concept, grounding the simulation results in physical reality.

Weaknesses

While this is an excellent paper, its primary weaknesses lie in the assumptions it makes about the interaction between its proposed system and the broader (and still developing) FTQC software/hardware stack.

The "Cost" of Deformation: The paper frames the overhead of CaliQEC primarily in terms of additional compensation qubits and scheduling complexity. However, the code deformation instructions themselves consist of sequences of physical measurements and gate operations. These operations are not error-free. There is a potential risk that the errors introduced by the deformation process itself could, in some regimes, negate the benefits of calibrating a drifting qubit. The analysis does not seem to fully account for the error burden of the deformation/reintegration process itself.
Scheduler Integration and Contention: The paper presents the calibration scheduling problem (Section 5, page 5-7) in isolation. In a real FTQC system, this scheduler would not operate in a vacuum. A separate compiler module would be scheduling logical operations, some of which (like T-gates via magic state distillation or logical CNOTs via lattice surgery) also require significant space-time resources and potentially use the same code deformation primitives. The paper does not discuss how resource contention between the "maintenance scheduler" (CaliQEC) and the "computation scheduler" would be resolved. This is a critical next step for integrating such a system into a full FTQC architecture.
Assumptions about Crosstalk Characterization: The method for characterizing crosstalk (Section 4, page 5) involves isolating qubits identified as neighbors. This model seems to primarily capture local crosstalk. However, longer-range crosstalk effects (e.g., frequency crowding, control line coupling) are known to exist in large quantum processors. The current model might underestimate the size of the "isolation zone" needed in a dense, large-scale system, which could impact the overhead calculations.

Questions to Address In Rebuttal

Regarding the code deformation instructions (Section 6, page 7-8), could the authors comment on the potential for the deformation/reintegration process to introduce errors? Have they analyzed the trade-off where the error incurred by the act of isolating a qubit might be greater than the error prevented by calibrating it?
The intra-group scheduling algorithm (Section 5.3, page 7) is a key component for efficiency. Could you elaborate on how this calibration scheduling would be integrated with a parallel scheduler for logical operations? For instance, how would the system prioritize between performing a logical CNOT via lattice surgery and isolating a patch for calibration if both require modifying the same region of the code?
In your evaluation (Section 8, page 10, QECali paragraph), the maximum tolerable distance loss is set to Ad=4. Could you provide more intuition on how this parameter was chosen? How sensitive are the overall performance results (particularly qubit overhead and retry risk) to the choice of Ad? Is there a systematic way to determine the optimal Ad for a given hardware architecture and application?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present CaliQEC, a framework for performing in-situ physical calibration of qubits within a live surface code computation. The central problem addressed is error drift, where qubit and gate performance degrades over time, eventually compromising the effectiveness of quantum error correction (QEC). The proposed solution leverages the theory of code deformation—a known technique for modifying the structure of a surface code patch—to temporarily isolate drifting physical qubits, perform a full calibration, and then reintegrate them into the code, all while the logical computation proceeds on the deformed patch. The framework also includes a device characterization stage and an adaptive scheduling algorithm to manage this process efficiently.

My analysis concludes that while the foundational theoretical tool (code deformation) is not new, its specific application and the comprehensive engineering framework built around it for the purpose of dynamic, in-situ physical calibration represents a novel and significant contribution. The key novelty lies in the shift from using deformation to handle static, permanent defects to managing dynamic, temporary states of qubit unavailability for maintenance.

Strengths

The primary strength of this paper, from a novelty perspective, is the creative repurposing and significant extension of an existing theoretical tool to solve a different, and highly practical, problem.

Novel Application of Code Deformation: The theory of code deformation [10, 67] is well-established, primarily for implementing logical gates (lattice surgery) or for handling static, defective qubits [53, 64]. This paper's core conceptual leap is to treat a qubit undergoing calibration as a temporary, scheduled defect. This reframing is non-trivial and allows the entire machinery of defect tolerance to be applied to the problem of runtime maintenance. This appears to be the first formal proposal and evaluation of such a strategy.
New Instruction Set for Heavy-Hexagon Topology: The authors acknowledge that the instructions for square lattices are adapted from prior work [70]. However, the design and formalization of a new code deformation instruction set specifically for the heavy-hexagon topology (Section 6.1, Page 7) is a concrete and novel contribution. This is particularly relevant given that this topology is used in state-of-the-art hardware (e.g., IBM devices) and presents non-trivial structural differences from a simple square lattice, such as shared ancilla qubits and varying qubit connectivity (Figure 8, Page 8).
Synthesis into a Complete Framework: The novelty is not just in a single idea but in the construction of a full-stack solution. The combination of (a) device characterization to model drift (Section 4, Page 5), (b) an adaptive scheduling algorithm to manage calibration overhead (Section 5, Page 5), and (c) a runtime deformation mechanism constitutes a complete, novel framework that did not exist before.

Weaknesses

My concerns are not with the validity of the work, but with ensuring the "delta" over prior art is precisely and defensibly articulated.

Overlap with Prior "In-situ Calibration" Concepts: The term "in-situ calibration" is not entirely new in this context. The work of Kelly et al. [34], "Scalable in situ qubit calibration during repetitive error detection," presents a method for concurrent calibration. The authors of the current paper do cite this work and differentiate their approach by stating that [34] relies on "speculative estimation of control parameters rather than physical calibration" (Section 2, Page 2). While this distinction is crucial, the novelty rests heavily on the argument that such estimation is insufficient for fault-tolerant QEC and that full, disruptive physical calibration (requiring isolation) is necessary. The paper’s novelty claim would be weakened if the methods in [34] could be extended to achieve the required fidelity.
Adaptation of Existing Scheduling Heuristics: The proposed scheduling algorithm (Section 5.3, Page 7) is a greedy heuristic designed to balance parallelism and resource overhead. While its application to scheduling code deformations is novel, the underlying principles (e.g., sorting by a priority metric, iteratively building non-conflicting batches) are common in classical scheduling problems. The paper should be careful not to overstate the algorithmic novelty of the scheduler itself, but rather focus on the novelty of the scheduling problem and the custom cost model (Cost = Δd * Σ t_cali[g]).

Questions to Address In Rebuttal

Regarding Kelly et al. [34]: The authors correctly differentiate their work from [34] on the grounds that it performs full physical calibration rather than parameter estimation. Could the authors elaborate further on the practical limitations of the approach in [34] that render it insufficient for the long-running, fault-tolerant applications targeted in this paper? Specifically, are there classes of drift (e.g., frequency shifts vs. amplitude errors) that estimation-based techniques fundamentally cannot correct, thus necessitating the physical isolation proposed in CaliQEC?
Novelty of the Scheduling Heuristic: The scheduling algorithm in Section 5 is presented as a heuristic to solve the complex trade-off between calibration time and code distance loss. While its application is novel, are the core algorithmic principles themselves adaptations of known scheduling heuristics from other domains? Clarifying this would help isolate the precise novel contribution in the scheduling component.
Generalizability of the Design Principles: The development of a deformation instruction set for the heavy-hexagon topology is a clear strength. How general is the design principle described as "leveraging residual connectivity" (Section 6.1, Page 8)? Could this principle be systematically applied to generate instruction sets for other complex QEC code geometries, or does each new topology require a completely new, bespoke design effort from first principles?

Variational Quantum Algorithms in the era of Early Fault Tolerance

Abstract

Quantum computing roadmaps predict the availability of 10,000-qubit devices within the next 3–5 years. With projected two-qubit error rates of 0.1%, these systems will enable certain operations under quantum error correction (QEC) using lightweight codes, ...

ACM LINK

Reviews

Review 1

Excellent. I will now embody the persona of "The Guardian" to provide a rigorous and skeptical peer review of the submitted paper.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose the use of partial quantum error correction (pQEC) for executing Variational Quantum Algorithms (VQAs) in the so-called "Early Fault Tolerance" (EFT) era. The core idea is to apply error correction only to the Clifford components of the circuit, while implementing the non-Clifford Rz(θ) rotations via a noisy magic state injection process. The paper claims this approach, termed EFT-VQA, yields significant fidelity improvements (up to 9.27x on average) over a "standard" approach using full QEC with T-state distillation (termed "qec-conventional"). The authors also introduce architectural optimizations, including a specialized qubit layout, a "patch shuffling" mechanism for state injection, and a "layout-aware" ansatz, which are claimed to reduce latency and improve resource efficiency.

While the premise of exploring intermediate error correction strategies is valid, the paper's central claims of large-scale VQA advantage are predicated on a critical methodological simplification—the use of Clifford state simulations for all systems larger than 12 qubits. This approach fundamentally misrepresents the nature of VQAs by discretizing the variational parameters, calling into question the validity and relevance of the main results presented in Figure 12.

Strengths

Problem Motivation: The paper correctly identifies a critical gap between the capabilities of current NISQ hardware and the resource requirements of full Fault-Tolerant Quantum Computing (FTQC). The definition and exploration of an intermediate "EFT" regime is a pragmatic and relevant direction for research.
pQEC as a Strategy: The choice to investigate pQEC is logical. The high overhead of T-gate synthesis and distillation is a well-known bottleneck, and exploring alternatives like magic state injection for continuous Rz rotations is a sensible trade-off in a resource-constrained environment.
Microarchitectural Ideas: The proposed "patch shuffling" mechanism (Section 4.2, Appendix A) for managing probabilistic state injection is a concrete and technically sound contribution. It demonstrates a detailed consideration of the operational costs at the error-corrected circuit level.

Weaknesses

Fundamentally Flawed Large-Scale Simulation Methodology: The paper's primary weakness, which undermines its most significant claims, is the use of Clifford state simulations for circuits with 16 to 100 qubits (Section 5.2.2). The authors state they "constrain the Rz(θ) arguments to multiples of π/2." This is not a VQA. VQAs are defined by their use of continuous variational parameters to explore a solution space. By discretizing the angles, the authors are solving a different, combinatorially constrained optimization problem. The justification that this is a "proxy" used in other works [43, 68, 77] is insufficient; those works often use it to benchmark hardware fidelity or error mitigation, not to make claims about the performance of a variational algorithm whose very power lies in its continuous nature. Consequently, the results in Figure 12 do not represent the performance of VQAs at scale and cannot be used to support the paper's central conclusion.
Inadequate Reference Energy for Scaled Benchmarks: For systems larger than 12 qubits, the reference ground state energy E0 used in the YA/B metric (Section 5.3) is defined as "the lowest stabilizer state energy obtained in the absence of noise." This is a profoundly limiting choice. It means the "improvement" is measured relative to the best possible energy achievable within a classically simulable subspace, not the true ground state of the Hamiltonian. The gap between the true ground state and the best stabilizer state can be substantial. Therefore, the reported YA/B values do not quantify how much closer the algorithm gets to the correct physical solution, but rather how well it performs within a highly restricted classical optimization.
Potentially Unfair Baseline Comparison: The "qec-conventional" baseline appears to be a strawman. The authors' argument against it hinges on two points: (1) low-distance codes yield insufficient T-gate fidelity, and (2) high-distance codes incur prohibitive memory errors due to long distillation times. While plausible, this analysis may not represent an optimized baseline. The modeling of memory errors could be overly pessimistic, and the paper does not consider more advanced compilation techniques or co-design of distillation factories and algorithms that could mitigate these stalling overheads. The conclusion that pQEC is superior rests on this specific, potentially uncharitable, model of the Clifford+T paradigm.
Limited Generality of Architectural Claims: The proposed "blocked_all_to_all" ansatz (Section 4.3) is designed explicitly to perform well on the proposed layout by minimizing costly, long-range CNOTs. While it is shown to execute faster (Table 2), its performance relative to the FCHE ansatz is inconsistent (Figure 14). The analysis lacks a rigorous study of expressibility. It is entirely possible that this faster ansatz is simply less capable of representing the ground state for certain problems (e.g., Heisenberg J=1), in which case the speedup is irrelevant. The claims of architectural efficiency are therefore tied to a specific, non-universal ansatz.

Questions to Address In Rebuttal

Regarding the Simulation Methodology: Can the authors provide evidence, perhaps from the 8 and 12 qubit density matrix simulations, that the optimal energy found using discretized Rz angles (Clifford simulation) is a reliable predictor of the optimal energy found using continuous angles (true VQA)? If not, how can the large-scale results in Figure 12 be considered representative of VQA performance?
Regarding the Reference Energy: For the Hamiltonians studied, what is the known gap between the true ground state energy and the lowest-energy stabilizer state for systems of 16+ qubits? Please quantify how this gap impacts the interpretation of the YA/B metric and the claim of outperforming NISQ.
Regarding the "qec-conventional" Baseline: The argument against large factories relies on memory error accumulation during stalls. Did the authors consider a baseline with multiple, parallelized distillation factories, which is a standard approach to increasing the T-state production rate and reducing stalls? How would this affect the breakeven point with pQEC?
Regarding the Layout-Aware Ansatz: The blocked_all_to_all ansatz is faster but sometimes performs worse. This suggests a trade-off between latency and expressibility. Could the authors provide a more rigorous analysis of this ansatz's expressibility (e.g., using techniques from [80]) to justify that the performance degradation seen in Figure 14 is not simply a result of the ansatz being unable to represent the solution?

Review 2

Excellent, this is a fascinating paper that sits at a critical juncture in the evolution of quantum computing. I will adopt the persona of "The Synthesizer" to review it.

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a compelling systems-level investigation into the execution of Variational Quantum Algorithms (VQAs) within the "Early Fault Tolerance" (EFT) era, a regime the authors define as possessing ~10,000 physical qubits with error rates around 0.1%. The core contribution is the application and adaptation of partial quantum error correction (pQEC) as a pragmatic middle ground between the prohibitively noisy NISQ paradigm and the resource-intensive full Fault-Tolerant Quantum Computing (FTQC) paradigm.

The authors propose to error-correct the Clifford gate "scaffolding" of VQA circuits using lightweight surface codes, while implementing the non-Clifford parametric Rz rotations via a more efficient, albeit noisier, magic state injection technique. This avoids the massive overhead of T-gate decomposition and distillation. Through simulation and analysis, they demonstrate that this "EFT-VQA" approach can yield significant fidelity improvements (up to 9.27x on average) over standard NISQ execution and also outperform a resource-constrained full QEC approach. The work is further strengthened by co-designing architectural and algorithmic optimizations, such as efficient qubit layouts, a "patch shuffling" mechanism for state injection, and a new layout-aware VQA ansatz (blocked_all_to_all), all tailored to the EFT context.

Strengths

Excellent Strategic Positioning and Timeliness: The paper's greatest strength is its focus on the critical, yet underexplored, gap between NISQ and FTQC. While much of the community remains polarized, this work provides a well-reasoned, quantitative roadmap for achieving utility in the intermediate term. By defining and targeting the "EFT" regime, the authors frame a problem that is of immediate relevance to hardware developers and algorithm designers over the next 5-10 years.
A Compelling and Pragmatic Core Idea: The application of pQEC to VQAs is an elegant synthesis of ideas. It recognizes the fundamental structure of VQA circuits—a backbone of entangling Clifford gates and a set of tunable, non-Clifford rotation gates. The insight to protect the structural integrity of the circuit with QEC while using a "good enough" method for the parametric parts is highly intuitive and powerful. This is a practical compromise that leverages the strengths of QEC where it matters most without paying the full, exorbitant cost of universal fault tolerance. The analysis in Section 3.2 (page 5), which thoughtfully weighs the trade-offs between distillation factory size, T-gate error, and memory error, is particularly insightful and highlights the subtle reasons why full FTQC is ill-suited for this regime.
Holistic Systems-Level Approach: This is not merely an algorithmic proposal; it is a co-designed hardware/software systems paper. The authors go beyond the core pQEC idea to consider its architectural implications in Section 4 (pages 6-9). The proposed qubit layout (Figure 3), the clever "patch shuffling" scheme to mitigate injection latency (Section 4.2), and the design of a new, latency-aware ansatz (Section 4.3) demonstrate a deep understanding of the practical challenges. This architectural perspective significantly elevates the work's potential impact.
Strong Connection to and Advancement of Prior Work: The authors correctly identify and build upon foundational ideas like magic state injection for arbitrary rotations from Lao et al. [46] and the concept of partial error correction from Akahoshi et al. [7]. Their contribution is not in inventing this technique, but in being the first, to my knowledge, to comprehensively analyze, adapt, and optimize it for the VQA application domain. This contextualizes their work properly and showcases its value as a crucial next step in a promising research direction.

Weaknesses

While the core vision is strong, the paper could be strengthened by broadening its context and a more rigorous definition of its baseline comparisons.

Ambiguity of the NISQ Baseline: The primary comparison is made against a "NISQ" baseline. However, the NISQ paradigm is not monolithic; it is characterized by a rich ecosystem of advanced error mitigation (EM) techniques (ZNE, PEC, DD, etc.). The paper shows a brief but welcome integration of one such technique, VarSaw, in Figure 15 (page 13), but the main results in Figures 12 and 13 appear to compare against an unmitigated or lightly mitigated NISQ execution. The crucial question for practitioners is not "Is pQEC better than noisy execution?" but "Is the overhead of pQEC a better investment than applying a suite of sophisticated, zero-overhead EM techniques?" A more direct comparison against a strong, multi-technique EM baseline would make the case for pQEC's superiority irrefutable.
Generalizability of the Rz/CNOT Ratio Assumption: The theoretical argument for pQEC's advantage, laid out in Section 4.4 (page 9), hinges critically on the ratio of CNOTs to Rz gates in the ansatz. The authors show this holds for their tested benchmarks, but VQA is a broad field. For instance, some Quantum Machine Learning ansatze might have different structural properties. The paper would be more robust if it acknowledged this dependency more explicitly and perhaps discussed which classes of VQAs are most (or least) likely to benefit from this approach based on their gate composition.
Underdefined Boundaries of the "EFT" Regime: The EFT era is defined primarily by qubit count and physical error rate. However, other architectural parameters are equally critical, such as code cycle time, measurement and reset fidelity/latency, and qubit connectivity. These factors directly impact the severity of memory errors—a key factor in the authors' analysis—and the viability of their proposed layout. A more detailed specification of the assumed underlying hardware model would help ground the results and clarify the domain of their applicability.

Questions to Address In Rebuttal

Could the authors clarify the precise error mitigation techniques, if any, applied to the "NISQ" baseline in the main comparison figures (e.g., Figure 12)? Can you elaborate on why you believe the significant gains from pQEC would persist even against a state-of-the-art NISQ baseline employing a combination of techniques like zero-noise extrapolation and probabilistic error cancellation?
The argument in Section 4.4 relies on the CNOT count growing faster than 0.76 times the Rz count. While this holds for hardware-efficient ansatze, can you comment on the applicability of your pQEC framework to other important VQA families, such as chemistry-inspired ansatze (e.g., UCCSD) or QAOA, and how their distinct structures might affect the performance crossover point between NISQ and EFT?
Beyond qubit count and gate fidelity, what are the most critical architectural assumptions (e.g., code cycle time, connectivity) that underpin your analysis? How sensitive are your conclusions, particularly regarding the outperformance of pQEC over "qec-conventional" (Figure 4), to these underlying hardware parameters?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper investigates the use of Variational Quantum Algorithms (VQAs) in the projected "Early Fault Tolerance" (EFT) era, characterized by ~10,000 physical qubits and physical error rates of ~0.1%. The central technical proposal is to adapt the concept of partial quantum error correction (pQEC) for VQAs. In this scheme, Clifford gates are error-corrected using surface codes, while non-Clifford Rz(θ) rotations are implemented via un-distilled, higher-error magic state injection. The authors evaluate this "EFT-VQA" approach against conventional NISQ execution and a more traditional distillation-based fault-tolerant approach ("qec-conventional"). They supplement this core idea with several architectural co-design proposals, including an efficient qubit layout, a "patch shuffling" mechanism to reduce latency from probabilistic state injection, and a layout-aware VQA ansatz.

Strengths

The primary strength of this work lies not in the invention of a single new primitive, but in the novel synthesis and detailed, application-specific optimization of existing concepts for a well-defined and important future computing regime.

Novel Application Domain: While the concept of pQEC is not new, its specific and thorough application to VQAs is a novel investigation. The authors provide a compelling argument for why VQAs, with their characteristic structure dominated by Rz(θ) rotations, are particularly well-suited for a pQEC approach. This targeted application is a new direction.
Specific Architectural Mechanisms: The paper proposes several concrete mechanisms that appear novel in this context:
- The "patch shuffling" technique (Section 4.2, page 7) is a clever microarchitectural optimization to pipeline the probabilistic Rz(θ) state injection process. This method of pre-injecting compensatory magic states (e.g., 2θ, 4θ) into a secondary patch to hide latency is a specific, novel solution to a key performance bottleneck of this approach.
- The "blocked_all_to_all" ansatz (Section 4.3, page 9, Figure 10) is a novel contribution. It represents a form of hardware-software co-design, where the algorithm's structure is explicitly tailored to the latency characteristics of CNOT operations on their proposed physical layout (analyzed in Figure 9). This is a more sophisticated approach than simply using standard hardware-efficient ansatze.
Novel Analytical Framework: The analysis in Section 4.4, which derives the CNOT-to-Rz(θ) gate ratio required for pQEC to outperform NISQ, provides a new and useful heuristic for designing ansatze specifically for the EFT era. This quantitative framework for evaluating algorithmic choices based on the underlying error correction strategy is a valuable and novel contribution.

Weaknesses

The paper's claims of novelty must be carefully qualified. The foundational concepts upon which the entire work is built are drawn from prior art.

Core Concept is Not Original: The central idea of partial QEC—error-correcting Clifford gates while using a different, less-protected mechanism for non-Clifford gates—is not new. The authors themselves appropriately cite Akahoshi et al. [7], who introduced a "partially fault-tolerant quantum computing architecture with error-corrected Clifford gates and space-time efficient analog rotations." The use of magic state injection for arbitrary Rz(θ) rotations is also based on prior work, namely Lao et al. [46]. The novelty of the present paper is therefore limited to the adaptation of this pre-existing strategy to VQAs, not its invention.
Incremental Nature of Layout Optimization: The proposed qubit layout (Section 4.1, Figure 3) is an efficient arrangement, but the principles of designing layouts to balance connectivity and qubit density for surface codes are well-established. While the layout is well-motivated for VQA-style circuits, it represents an incremental engineering optimization over existing layout design principles rather than a fundamentally new architectural paradigm.
Complexity vs. Benefit: The proposed blocked_all_to_all ansatz is more complex and less general than a standard fully-connected hardware-efficient ansatz. The results in Figure 14 show that while it often performs better, there are cases (e.g., Heisenberg J=1) where it performs significantly worse. This suggests that the novel ansatz, while beneficial, introduces a new layer of problem-specific tuning, and its advantage is not universal. The trade-off between its specialized performance and its reduced expressibility compared to a fully-connected ansatz needs clearer justification.

Questions to Address In Rebuttal

The paper's contribution hinges on the "adaptation" of pQEC from Akahoshi et al. [7] to VQAs. Beyond the change in application focus from Trotterized simulation to VQAs, what were the non-trivial technical challenges in this adaptation that represent a novel contribution? Please clarify the specific conceptual "delta" between the general architecture in [7] and the one proposed here.
The "patch shuffling" idea is presented as a novel mechanism. Can the authors comment on any related prior art in the domain of pipelining or speculative execution for probabilistic processes, either in classical or quantum architecture, that may be conceptually similar?
The proposed blocked_all_to_all ansatz is designed to minimize the use of high-latency "linking CNOTs." Does this design choice risk creating an ansatz with insufficient entanglement capability for Hamiltonians that require long-range correlations between the "blocks"? How can one know a priori if this novel ansatz is suitable for a given problem?

SwitchQNet: Optimizing Distributed Quantum Computing for Quantum Data Centers with Switch Networks

Abstract

Distributed Quantum Computing (DQC) provides a scalable architecture by interconnecting multiple quantum processor units (QPUs). Among various DQC implementations, quantum data centers (QDCs) — where QPUs in different racks are connected through ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the perspective of "The Guardian."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present SwitchQNet, a compiler designed to optimize communication scheduling for Distributed Quantum Computers (DQCs) in a data center (QDC) architecture that utilizes reconfigurable optical switches. The primary contribution is a co-optimization strategy across the program and network layers that employs look-ahead scheduling to batch in-rack communications and parallelize cross-rack communications through a "splitting" technique. The authors claim an 8.02x reduction in communication latency compared to a baseline approach, with what they characterize as "small overhead." While the paper addresses a relevant problem, the claims of significant improvement appear to rest on a potentially weak baseline and a set of optimistic assumptions, and the proposed methodology contains mechanisms that suggest a lack of guaranteed performance.

Strengths

Problem Formulation: The paper correctly identifies a critical bottleneck in the targeted QDC architecture [59]: the high latency of cross-rack communication and switch reconfiguration relative to in-rack communication. The motivational analysis in Figure 2 (Page 2) provides a clear and compelling quantitative justification for focusing on this problem.
Core Mechanism: The central idea of splitting a congested cross-rack EPR pair generation into a parallel cross-rack generation and a subsequent in-rack generation (via entanglement swapping) is a technically plausible strategy for latency hiding.

Weaknesses

Baseline Insufficiency: The entire 8.02x performance claim is predicated on outperforming a baseline described as a "buffer-assisted on-demand EPR generation" (Section 5.1, Page 9). This appears to be a strawman. An "on-demand" strategy, which waits until a resource is needed to schedule its creation, is the most naive approach possible. A more reasonable baseline would incorporate at least a simple pre-fetching or look-ahead scheduling heuristic, which would almost certainly narrow the performance gap significantly. Without a comparison to a more sophisticated, state-of-the-art scheduling algorithm, the claimed improvement is unsubstantiated.
The "Auto Retry" Mechanism Reveals a Fundamental Flaw: The inclusion of an "auto retry" mechanism (Section 4.5, Page 8) is a major red flag. The authors state it is for "rare cases," but its existence implies that the primary scheduling algorithm (with its "soft conditions" and "projected buffer" heuristics) is not guaranteed to produce a valid, deadlock-free schedule. The paper fails to provide any data on how frequently this mechanism is triggered. The "retry overhead" metric is insufficient, as it only measures the number of explored time steps, not the actual computational cost (i.e., wall-clock time) of the compiler having to backtrack and re-compute schedules. This suggests the compiler itself could be impractical for complex programs where these "rare cases" become common.
Overstated and Parameter-Specific Claims: The headline "8.02x" figure is an average that obscures significant variance and is highly dependent on the chosen hyper-parameters. For example, the authors' own QEC integration experiment (Section 5.5, Page 13) shows the improvement factor dropping to an average of 4.89x. This suggests that for structured, fault-tolerant algorithms—the primary use case for large-scale DQC—the optimization is far less effective. The performance is also shown to be sensitive to look-ahead depth and buffer size (Figure 8, Page 11), yet the paper presents the 8.02x figure as a general result.
Unjustified Notion of "Small Overhead": The authors claim the overhead is small, but the data in Table 2 (Page 10) can be interpreted otherwise. The average wait time for an EPR pair in the buffer increases by a factor of 6.51x. Given a reconfiguration latency of 1 ms, this translates to an average wait of several milliseconds. The paper completely fails to analyze the impact this has on qubit fidelity due to decoherence. For systems with T2 times in the tens of milliseconds, this is a non-trivial source of error that is not captured by their "EPR overhead" metric, which only accounts for the generation of additional pairs. A "small" latency overhead can easily become a "large" fidelity penalty.
Heuristics and Conditions Lack Rigor: The scheduling framework relies on a set of "Hard" and "Soft" conditions (Section 4.2, Page 6). The soft condition, in particular, uses an adjustable threshold and a heuristic projected_buffer calculation. This lacks formal guarantees. The deadlock prevention mechanism for multiple splits (reserving buffer space, Section 4.3, Page 7) feels like an ad-hoc patch that could excessively constrain the scheduler, potentially negating the benefits of splitting by creating a new buffer bottleneck.

Questions to Address In Rebuttal

Please justify the choice of an "on-demand" baseline. Can you provide a comparison against a more competitive baseline, such as one using a simple fixed-window look-ahead for pre-fetching without the communication splitting optimization?
What is the frequency of the "Auto Retry" mechanism being triggered across your benchmark suite? Please provide data on the increase in compiler execution time when this mechanism is used, not just the "retry overhead" metric. For which specific communication patterns or resource constraints does your primary algorithm fail?
The average wait time for buffered EPR pairs increases by 6.51x (to ~6.5 ms). Assuming a range of realistic T2 coherence times for near-term hardware (e.g., 10 ms to 100 ms), what is the calculated impact on the final state fidelity of the quantum program? How does this fidelity degradation compare to the fidelity improvements gained from distillation?
The performance improvement drops from 8.02x to 4.89x when moving to QEC-encoded circuits. Please provide a detailed analysis explaining why your optimizations are significantly less effective for the communication patterns found in fault-tolerant algorithms like the surface code. Does this not undermine the claim of applicability to FTQC?

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents SwitchQNet, a compiler designed to optimize communication in Distributed Quantum Computing (DQC) systems, specifically for the emerging Quantum Data Center (QDC) architecture based on reconfigurable optical switches. The authors identify a critical performance bottleneck in this architecture: the significant latency overheads introduced by slow switch reconfigurations and even slower cross-rack Entangled Bell Pair (EPR) generation.

To address this, the work introduces a novel co-design compiler that operates across both the program and network layers. The core contribution lies in two primary optimizations: 1) the collective generation of in-rack EPR pairs to amortize the high, fixed cost of switch reconfigurations, and 2) the splitting of congested or sequential cross-rack communications into parallelizable cross-rack and in-rack components. The latter cleverly trades a high-latency operation for a set of lower-latency ones that can be executed in parallel, with the incurred fidelity cost being mitigated via entanglement distillation. The work positions itself as a critical software layer for making near-term, optically-switched QDC architectures practical. Through comprehensive simulation, the authors demonstrate an impressive average 8.02x reduction in communication latency over baseline approaches, with a manageable fidelity and resource overhead.

Strengths

The primary strength of this paper is its timely and insightful contribution to a highly plausible, near-term model for scaling quantum computers. It bridges the gap between a promising hardware architecture and the software required to make it performant.

Excellent Problem Contextualization: The authors do a superb job of situating their work within the broader landscape. They correctly identify the QDC as a distinct paradigm from long-range quantum repeater networks, noting its unique challenges like memoryless switches and heterogeneous communication latencies (Section 2, p. 2-4). Their motivation, clearly illustrated in Figure 2 (p. 2), highlights that switch reconfiguration and cross-rack links are the dominant sources of latency, justifying the focus of their optimizations.
Novel and Well-Adapted Compiler Techniques: The core ideas of "collection" and "splitting" are elegant applications of classical latency-hiding principles (like amortization and parallelization) to the non-classical constraints of a quantum network. The "splitting" technique is particularly clever; it recognizes that in-rack communication is an under-utilized, high-speed resource that can be leveraged to unblock the slower, cross-rack bottleneck. The integration of entanglement distillation to manage the fidelity cost of this split demonstrates a deep, cross-stack understanding of the problem.
Holistic, Cross-Layer Approach: The most significant aspect of this work is its co-optimization strategy. Rather than treating the program's communication demands and the network's physical state as separate problems, SwitchQNet's compiler considers them simultaneously. The look-ahead scheduling, which is aware of both the quantum circuit's dependency graph and the available network bandwidth/buffer space, is precisely the kind of holistic approach needed to manage the complex trade-offs in DQC.
Robust and Forward-Looking Evaluation: The experimental evaluation is thorough. The authors not only show a significant performance improvement but also demonstrate the robustness of their approach across various benchmarks, network topologies, and hardware parameters (Section 5.4, p. 12). The inclusion of a study on integrating their compiler with Quantum Error Correction (QEC) using surface codes (Section 5.5, p. 13) is a major strength, as it shows the pathway for this work to remain relevant in the fault-tolerant era.

Weaknesses

The weaknesses of the paper are largely related to its focused scope and the assumptions necessary for a simulation-based study. They do not detract significantly from the core contribution.

Tight Coupling to a Specific Architecture: The compiler's optimizations are highly tailored to the hierarchical, switched QDC architecture described in [59]. While this focus is what makes the work so effective for that model, it also limits its direct generalizability. It is less clear how these specific techniques would apply to alternative DQC interconnects, such as those with static, non-reconfigurable links or architectures based on fundamentally different physical principles.
Abstracted Fidelity Model: The paper handles fidelity overhead commendably by including entanglement distillation and modeling the costs of additional EPR pairs. However, the analysis necessarily relies on specific, assumed values for component fidelities and distillation success rates. The real-world interplay between latency, decoherence during storage, and the noise introduced by distillation could be more complex. A deeper exploration of the sensitivity of the results to these fidelity parameters would strengthen the claims.
Scalability of the Compilation Process: The paper demonstrates that the output of the compiler enables scalable quantum computation. However, the computational complexity of the compiler itself is not discussed. The look-ahead scheduling, dependency analysis, and particularly the auto-retry mechanism for resolving deadlocks could potentially become a bottleneck for compiling extremely large and complex quantum programs.

Questions to Address In Rebuttal

Generalizability of Principles: The proposed optimizations are tailored to a hierarchical, switched QDC. Could the authors comment on how the core principles of "collection" and "splitting" might be adapted for other DQC interconnect paradigms, such as those with static all-to-all optical connections or perhaps micro-electromechanical systems (MEMS) based switches with different latency/loss profiles?
The Fidelity-Latency Trade-off: The paper shows that the latency benefits far outweigh the modest increase in weighted EPR pair count. However, could the authors elaborate on the break-even point? At what level of distillation inefficiency or base in-rack EPR infidelity would the "splitting" optimization become detrimental to overall computational fidelity, effectively making the latency reduction a Pyrrhic victory?
Practicality of the Retry Mechanism: The auto-retry mechanism is a crucial component for ensuring deadlock-free execution. The results show it is rarely triggered (Table 2, p. 10). However, in the cases where it is, how does reverting to a more conservative "buffer-assisted on-demand" strategy impact performance? Is the slowdown graceful, or does it represent a significant performance cliff for certain communication patterns?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present SwitchQNet, a compiler designed to optimize communication scheduling for a specific, near-term architecture for Quantum Data Centers (QDCs). This architecture, based on prior work [59], utilizes a hierarchical network of reconfigurable classical and quantum optical switches to connect racks of QPUs. The paper identifies the high latency of cross-rack communication and switch reconfiguration as the primary performance bottlenecks.

To address this, the authors propose a co-optimization strategy across the program and network layers, centered on two primary techniques: 1. Collective in-rack EPR generation: Batching requests for in-rack EPR pairs to minimize the overhead of frequent, slow switch reconfigurations. 2. Parallelized cross-rack EPR generation: A "split" mechanism where a congested cross-rack EPR pair request between QPU A and B is decomposed into a cross-rack pair between a less busy QPU A' (in the same rack as A) and B, and a subsequent in-rack pair between A and A'. These are later joined via entanglement swapping.

The compiler employs a look-ahead scheduling approach, manages buffer resources to avoid deadlock and congestion, and uses entanglement distillation to mitigate the fidelity overhead introduced by the "split" mechanism. The evaluation demonstrates a significant latency reduction of up to 8.02x compared to a baseline on-demand scheduling approach.

Strengths

The primary novel contribution of this work is the specific strategy for parallelizing cross-rack communication by offloading the long-latency component to an idle QPU within the same rack (Section 3, page 5). While routing via intermediate nodes is a well-known concept in repeater networks, the application here is distinct. It is not about finding an end-to-end path through a network of repeaters; rather, it is a specific maneuver to alleviate QPU-level congestion within a local data center topology. This demonstrates a clear understanding of the specific bottlenecks in the target architecture [59] and proposes a tailored, non-obvious solution.

Furthermore, the synthesis of this "split" mechanism with the collective generation of in-rack pairs and the associated resource management framework constitutes a cohesive and novel compilation strategy. While the individual components have conceptual precedents, their combination to holistically address the unique trade-offs (reconfiguration latency vs. cross-rack latency vs. fidelity) of this switched QDC architecture is new. The work correctly identifies that a simple on-demand approach is insufficient and that a more sophisticated co-design is necessary. The significant performance gains reported in the evaluation suggest that this novel synthesis is not merely an academic exercise but a potent optimization.

Weaknesses

The core weakness of this paper lies in the framing of its novelty. Several of the foundational ideas presented as key enablers of the proposed system are, in fact, well-established principles from prior art in both quantum and classical computing.

Decoupled EPR Generation and Buffering: The idea of pre-generating EPR pairs and storing them in a buffer to hide communication latency is a standard technique in distributed quantum computing. The authors’ own baseline construction is based on [70] (Qucomm), which is a buffer-aware compiler. This is not a novel quantum feature leveraged by this work, but rather a prerequisite for any advanced DQC scheduling.
Collective Generation: The concept of batching multiple small operations to amortize a large setup cost is a cornerstone of computer science. The "collective generation of near-future in-rack EPR pairs" (Section 3, page 5) is a direct application of this principle, where the "operations" are in-rack EPR generations and the "setup cost" is the switch reconfiguration latency. While effective, this is an incremental and logical application of a known optimization pattern, not a fundamentally new idea.
Entanglement Swapping and Distillation: These are fundamental primitives of quantum communication, not novel techniques introduced by this work. Their use is a necessary implementation detail for the core "split" idea.

The paper would be stronger if it were more precise in its claims. The novelty is not in using buffers, batching, or swapping, but in the specific algorithm that orchestrates them—namely, the decision logic of when to split cross-rack communications and how to collectively schedule the resulting in-rack dependencies. The current presentation risks overstating the conceptual novelty by conflating the application with the invention of these underlying principles.

Questions to Address In Rebuttal

The central "cross-rack split" mechanism appears functionally similar to routing protocols in long-range quantum networks (e.g., [54], [60]), where an end-to-end link is established via intermediate nodes and entanglement swapping. Please clarify the fundamental distinction. Is the novelty purely in its application to solve QPU congestion rather than link availability/distance, or is there a more fundamental difference in the algorithm itself?
The collective generation of in-rack pairs is presented as a key optimization. Can the authors provide citations to prior work in DQC or classical distributed/HPC systems that use communication batching to overcome network setup/reconfiguration overheads, and then explicitly state the delta between those works and the proposed method?
The proposed "split" introduces both additional EPR pairs (requiring distillation) and dependencies on buffer availability, as illustrated in the deadlock scenario (Figure 7b, page 7). This adds significant complexity. Is there a quantitative threshold or heuristic derived from this work where the fidelity cost and scheduling complexity of the split-and-distill approach begin to outweigh the latency benefits? For instance, how does the strategy perform if the base in-rack fidelity is significantly lower, requiring more rounds of distillation?

Assassyn: A Unified Abstraction for Architectural Simulation and Implementation

Abstract

The continuous growth of on-chip transistors driven by technology scaling urges architecture developers to design and implement novel architectures to effectively utilize the excessive on-chip resources. Due to the challenges of programming in register-...

ACM LINK

Reviews

Review 1

Here is a peer review of the paper from the perspective of 'The Guardian'.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present Assassyn, a high-level programming framework intended to unify architectural simulation and RTL implementation from a single source description. The core abstraction models pipeline stages as functions that are invoked asynchronously, akin to asynchronous event handling in software. The system compiles this high-level description into two targets: a cycle-accurate simulator in Rust and a synthesizable RTL implementation in SystemVerilog. The authors evaluate Assassyn by implementing several hardware designs, from simple components to an out-of-order CPU, and compare them against handcrafted RTL and HLS-generated designs on metrics of productivity (LoC), performance, and area. The central claim is that this unified approach ensures perfect alignment between simulation and implementation while offering high productivity and generating quality hardware.

Strengths

Guaranteed Simulator-RTL Alignment: The paper's strongest contribution is the co-generation of a simulator and RTL from a single source. The claim that "all the cycles counts from Verilator simulated RTL exactly match our Assassyn-generated Rust simulator" (Section 7, Q5, page 11) is a significant and valuable property, directly addressing a well-known pain point in hardware development.
Coherent Abstraction Model: The core concept of modeling pipelined hardware as a system of asynchronously communicating functions is internally consistent and provides a structured way to think about dataflow and control flow between stages.
Demonstrated Breadth of Application: The authors have applied the framework to a non-trivial range of designs, including a systolic array, an in-order CPU, and a rudimentary out-of-order CPU. This demonstrates that the abstraction is not limited to a single architectural pattern.

Weaknesses

My primary concerns with this submission relate to the rigor of the evaluation and the strength of the claims, which appear to be overstated based on the evidence provided.

Unsupported Claims of "Comparable" Quality to Handcrafted RTL: The abstract and Section 7 (Q3) claim the generated RTL achieves "comparable perf/area compared to handcrafted RTL." The paper's own data contradicts this. In Figure 14 (page 10), the Assassyn-generated 5-stage CPU has an area of 365µm², a 42% overhead compared to the 257µm² Sodor reference. A 42% area increase is by no means "comparable" in a hardware context. The normalized perf/area metric in Figure 12 obscures this critical detail. The performance claim is also weak; a 2.6% IPC improvement (Figure 15a) is marginal and comes at a significant area cost. The claims must be revised to accurately reflect the data.
Flawed and Misleading Productivity Metrics: The Lines of Code (LoC) comparison in Section 7 (Q2, page 10) is methodologically unsound. The authors explicitly state, "We excluded all the highly overengineered common modules in Chipyard-related reference designs for Sodor CPU, and Gemmini systolic array, or the LoC comparison will be badly skewed." This is a textbook case of cherry-picking the comparison baseline to support a claim. A rigorous comparison requires including all necessary code or providing a principled justification for exclusions. As it stands, the 70% LoC claim is not credible. Furthermore, LoC is a poor proxy for design effort or complexity.
Absence of Stated Limitations and Overstated Generality: The paper presents Assassyn as a "general-purpose" framework but fails to discuss its limitations. What architectural features can it not express? The paper is silent on critical topics such as multi-clock domains, asynchronous resets, complex memory interfaces (e.g., AXI), or arbitrary finite state machines that do not fit the simple pipeline-stage model. The admission that radix_sort and merge_sort were "particularly challenging to manually manage" (Section 8.2, page 13) suggests the abstraction has weaknesses for control-intensive algorithms, but this is not explored as a formal limitation. Without a "Limitations" section, the "general-purpose" claim is unsubstantiated.
Superficial Simulator Performance Comparison: The comparison against gem5 (Section 7, Q5, page 11) is an apples-to-oranges comparison. Gem5 is a detailed, full-system simulation framework capable of booting operating systems. Assassyn appears to be a module-level or core-level simulator. The authors' own admission that "for longer-running workloads like qsort and rsort, gem5 achieves an order of magnitude speedup" invalidates the headline claim of achieving "2.2-8.1x simulation speedup" as a general finding. The speedup is only valid relative to Verilator and for short-running workloads.
Insufficient Detail on Advanced Implementations: The out-of-order (OoO) CPU case study (Section 7, Q6, page 12) is used to demonstrate scalability to complex designs. However, the implementation is described at a very high level. Critical microarchitectural details are missing: What are the sizes of the reservation station and reorder buffer? What is the issue width? What specific OoO features (e.g., register renaming scheme, memory disambiguation) were implemented? Without these details, it is impossible to assess whether Assassyn truly scales to a realistic OoO core or merely a toy example. The reported 1.26x speedup for a 1.43x area overhead is also a questionable design trade-off.

Questions to Address In Rebuttal

Please justify the claim that a 42% area overhead for the in-order CPU (Figure 14) is "comparable" to a handcrafted design. Please provide a quantitative definition of "comparable" that you will adhere to throughout the paper.
Please either (a) re-run the LoC comparison against the Sodor and Gemmini baselines without excluding the Chipyard modules you deemed "overengineered," or (b) provide a much more rigorous, file-by-file justification for why specific modules in the reference are not part of the core design and should be excluded.
Please enumerate the architectural features and hardware design patterns that Assassyn cannot currently support. Specifically address multi-clock designs, standard bus protocols like AXI, and arbitrary FSMs that do not map cleanly to pipeline stages.
To substantiate the claim of scalability to complex designs, please provide the key microarchitectural parameters for your OoO CPU implementation (e.g., issue/retire width, ROB size, reservation station size, load/store queue size).
The arbiter generation (Section 4.2) appears to rely on user-provided tags like #priority_arbiter. How does the framework handle more complex arbitration schemes (e.g., weighted round-robin, matrix arbiters)? Is the generation fully automatic for any topology, or does it require manual intervention for non-trivial cases?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Assassyn, a novel programming framework designed to unify the processes of architectural simulation and hardware implementation. The authors identify a fundamental and persistent challenge in computer architecture: the divergence between high-level performance models (often written in C++/Python) and low-level hardware implementations (written in RTL). This divergence creates significant overhead in maintaining alignment and verifying correctness across separate codebases.

The core contribution of Assassyn is a unified, high-level programming abstraction based on the concept of asynchronous event handling. The key insight is that the behavior of a pipelined hardware architecture can be elegantly modeled as a set of stages that asynchronously invoke one another, passing data forward. This single abstraction is expressive enough to describe complex hardware behavior and, crucially, can be compiled to generate both a cycle-accurate simulator (in Rust) and a synthesizable RTL implementation (in SystemVerilog). The evaluation demonstrates that this approach not only achieves its unification goal but also results in high productivity, high-quality hardware, and a simulation performance that is both fast and perfectly aligned with the generated RTL.

Strengths

Elegant and Powerful Core Abstraction: The central idea of using asynchronous event handling (or more simply, asynchronous function calls) as a unifying paradigm for both simulation and synthesis is exceptionally strong. The paper does an excellent job in Section 2 (Page 3) of motivating this by identifying the "push/pull mismatch" between simulation and RTL and highlighting the "transposed" correspondence between an event trace and an RTL waveform (Figure 2, Page 3). This is a deep and insightful observation that forms a solid foundation for the entire framework. It reframes hardware design from a structural netlist-centric view to a behavioral, dataflow-centric one.
Tackles a Foundational Problem: The work addresses a genuine and costly problem in the field. The effort spent bridging the gap between architectural models and RTL is a major bottleneck in both academic research and industrial practice. By proposing a fundamental solution rather than an incremental fix, this work has the potential for significant impact on how hardware is designed and verified.
Demonstrated Generality: A major risk for any new hardware description paradigm is that it only works for a narrow class of problems. The authors effectively mitigate this concern through a well-chosen set of evaluation targets. By successfully implementing not only a simple linear pipeline (in-order CPU) but also a complex 2D dataflow architecture (systolic array) and a design with significant control-flow complexity (an out-of-order CPU), they make a convincing case for the framework's generality. The bind abstraction for handling multi-source dataflows (Section 3.7, Page 5) is a particularly clever feature that enables this expressiveness.
Strong and Well-Rounded Evaluation: The paper is supported by a comprehensive evaluation that validates its primary claims. The results are compelling across multiple axes:
- Productivity: A ~30% reduction in lines of code compared to handcrafted RTL is a significant improvement (Figure 11, Page 9).
- Quality of Results (QoR): The generated RTL is shown to be comparable in performance and area to manual designs and vastly superior to a traditional HLS flow (Figure 12, Page 10). This is a critical result that proves the abstraction does not come at an unacceptable hardware cost.
- Simulator Fidelity and Performance: The fact that the generated simulator is perfectly cycle-aligned with the RTL is a massive advantage over conventional simulators like gem5. Achieving this alignment while also being 2.2-8.1x faster than RTL simulation (Verilator) is the "killer app" of this unified approach (Figure 16, Page 11).
- Agility: The progressive CPU design case study (Section Q6, Page 12) is a masterful demonstration of the framework's power, showing how architects can rapidly iterate on design choices and immediately understand the performance and area trade-offs.

Weaknesses

Scalability to Full System-on-Chip (SoC) Complexity: The presented examples, while impressive, are largely self-contained compute kernels or processors. It is less clear how the Assassyn paradigm scales to the integration challenges of a full SoC, which involves heterogeneous IP, complex memory subsystems, and standardized bus protocols (e.g., AXI, CHI). The simple async_call model may need significant extension to handle the transactional, multi-master nature of modern interconnects.
Positioning Relative to Modern Hardware Construction Languages: The paper positions itself well against traditional RTL and HLS. However, it could be more deeply contextualized against the landscape of modern hardware generator frameworks like Chisel/FIRRTL and SpinalHDL. While the authors correctly characterize Chisel as a more powerful "wrapper" for RTL (Section 8, Page 12), these frameworks are also evolving to incorporate higher-level abstractions. The paper would be strengthened by a more direct comparison of Assassyn's asynchronous event paradigm against the generator-based and metaprogramming paradigms used in these more established high-level tools.
Abstraction of Physical Realities: While the abstraction is a key strength, it may also hide important low-level details from the designer too effectively. The framework automates the generation of FIFOs and state machines, but provides little insight into how a designer would manage physical design challenges like timing closure, clock domain crossing, or layout-dependent wire delays. For a design to be practical, there must be a path for physical implementation feedback to influence the high-level source code. This feedback loop is not discussed.

Questions to Address In Rebuttal

Debugging and Verification: The paper claims that the serialized nature of the event-driven simulator simplifies debugging. While this is likely true for functional bugs, how does a designer debug a performance bug or a hardware-specific issue that only manifests in the generated RTL? For instance, if an unexpected stall occurs due to a full FIFO that was automatically generated, what tools or methodologies does Assassyn provide to trace this back to the high-level source and understand the cause?
Interoperability with Legacy IP: A key requirement for any practical hardware design flow is the ability to integrate existing, third-party IP, which is almost always delivered as standard Verilog/VHDL. How does an Assassyn-generated module communicate with a standard RTL module? Is there a defined methodology for wrapping legacy IP or for generating standard interface wrappers (e.g., AXI-Stream) on Assassyn modules?
The "C-like" Analogy: In the discussion (Section 8, Page 13), the authors make a powerful analogy, framing Assassyn as a "C-like" abstraction for hardware compared to the "assembly" of SystemVerilog. Could the authors elaborate on this? Specifically, what are the hardware design equivalents of software's unstructured goto that their paradigm deliberately constrains or eliminates, and what structured programming principles does it enforce in its place? Sharpening this analogy could further clarify the work's conceptual contribution.

Review 3

Excellent. I will now embody "The Innovator" and provide a peer review focused exclusively on the novelty of the work presented in "Assassyn: A Unified Abstraction for Architectural Simulation and Implementation."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present Assassyn, a programming framework intended to unify architectural simulation and RTL implementation from a single, high-level source code. The core abstraction proposed is based on asynchronous event handling, where pipeline stages are modeled as functions that are invoked asynchronously (async_call), and dataflows between stages are managed through function arguments and a bind primitive for multi-source inputs. The framework includes a compiler that transforms this high-level description into two backends: a cycle-accurate simulator in Rust and a synthesizable RTL implementation in SystemVerilog. The authors claim this is the "first RTL generator that is both high-level, and general-purpose" (Page 2, Section 1).

My assessment is that while the specific implementation and the synthesis of ideas from software engineering is elegant, the fundamental claim of novelty is overstated. The concept of a unified, high-level abstraction for generating both simulation models and RTL is not new. The novelty of Assassyn is therefore limited to its specific choice of abstraction—asynchronous function calls—which itself has strong conceptual parallels to existing paradigms in prior art that are not adequately addressed in this paper.

Strengths

Elegant Abstraction: The application of software engineering concepts like async function calls and bind (function partials) to hardware description is a clean and intuitive mapping for pipelined and dataflow architectures. The systolic array example (Page 6, Figure 5) is a particularly strong demonstration of the bind primitive's expressiveness for regular structures.
Demonstrated Generality: The authors successfully demonstrate their framework on a diverse set of designs, from simple components (priority queue), to dataflow accelerators (systolic array), to complex control-flow architectures (in-order and out-of-order CPUs). This lends credibility to the "general-purpose" aspect of their claim, differentiating it from more domain-specific languages for hardware generation.
Unified Backend Generation: The direct generation of a perfectly aligned simulator and RTL from a single source is a significant engineering achievement. The authors' empirical demonstration of subtle behavioral mismatches in established tools like gem5 (Page 11, Section Q5) provides a compelling motivation for this unified approach.

Weaknesses

My critique is centered on the core novelty claim.

The Concept of a Unified High-Level Abstraction is Not Novel: The paper frames the separation of simulation and RTL codebases as the primary problem it solves with a novel unified approach. However, this unification has been a central goal and achievement of high-level hardware construction languages for over two decades.
- Bluespec SystemVerilog (BSV) [1]: BSV provides a high-level abstraction based on Guarded Atomic Actions (Rules). A design is expressed as a set of rules, and the BSV compiler generates both cycle-accurate, executable C++ models and synthesizable Verilog RTL from this single source. The core premise of unification is identical.
- Chisel [6]: Chisel is a Scala-embedded DSL that generates Verilog. The same Chisel code is used to generate the RTL and can be executed to produce simulation behavior (via tools like Treadle or Verilator-backed simulation). While the authors argue Chisel adheres "closely to the RTL's programming and execution model" (Page 4, Section 2.1), it is nonetheless a unified, high-level, and general-purpose framework.
Conceptual Overlap with Prior Abstractions is Not Addressed: The paper's primary contribution must lie in its specific abstraction (async_call + wait_until). However, this model is functionally and conceptually very similar to the Rule-based model in BSV, a comparison the paper entirely omits.
- An Assassyn stage defined with a wait_until condition is analogous to a BSV Rule with an implicit guard condition. The Assassyn stage "executes" when its condition is met; the BSV Rule "fires" when its guard condition is met.
- Both abstractions aim to raise the level of thinking away from manual clock-edge management and state machine encoding. The "delta" between Assassyn's model and BSV's seems to be primarily syntactic and stylistic (imperative/functional async calls vs. declarative Rules) rather than a fundamental leap in expressive power or a new hardware paradigm. The lack of a direct comparison to BSV is a critical flaw in the paper's argument for novelty.
The Novelty Claim is Imprecise: The statement that Assassyn is the "first RTL generator that is both high-level, and general-purpose" (Page 2, Section 1) is demonstrably false, as BSV and Chisel both fit this description. The authors must narrow their claim to the specifics of their programming model and provide a compelling argument for why it represents a significant advancement over these established systems.

Questions to Address In Rebuttal

Please provide a detailed technical comparison between Assassyn's asynchronous event-handling model (async_call, wait_until) and the Guarded Atomic Actions (Rules) paradigm in Bluespec SystemVerilog. What specific hardware designs or behaviors can be expressed more easily or efficiently in Assassyn that are difficult or impossible in BSV, and vice-versa? Without this comparison, the novelty of the core abstraction is unclear.
Given the existence of Chisel and Bluespec SystemVerilog as prior art for unified, high-level, general-purpose RTL generation, please refine the central novelty claim of this paper. What precise, novel contribution does Assassyn make that these frameworks do not?
The proposed abstraction appears well-suited for architectures that can be modeled as a static graph of components pushing data to one another. How does the abstraction handle more dynamic or irregular control and data flows, such as a complex on-chip network router or a shared bus arbiter where components must pull data based on state, rather than having it pushed via a pre-bound connection?

Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion

Abstract

Cycle- level simulators such as gem5 are widely used in microarchitecture design, but they are prohibitively slow for large-scale design space explorations. We present Concorde, a new methodology for learning fast and accurate performance models of ...

ACM LINK

Reviews

Review 1

Of course. Here is a review of the paper from the perspective of 'The Guardian'.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present Concorde, a performance modeling methodology for CPUs that aims to supplant slow, cycle-level simulators. The approach is a two-stage process: first, a set of simplified analytical models are used to generate compact performance distributions for individual microarchitectural components, assuming each operates in isolation. These distributions, derived from an instruction trace, serve as features. Second, a shallow Multi-Layer Perceptron (MLP) is trained to take these distributions and a vector of microarchitectural parameters as input to predict the final Cycles-Per-Instruction (CPI). The authors claim this "analytical-ML fusion" is over five orders of magnitude faster than their reference simulator at inference time, with an average prediction error of approximately 2%. They demonstrate a use case for this speed by performing a large-scale performance attribution analysis using Shapley values.

Strengths

The conceptual decomposition of the performance modeling problem into isolated analytical estimations followed by a unifying ML model is an interesting research direction.
The methodology's architecture, which separates a one-time, per-program precomputation cost from a near-instantaneous, per-configuration inference cost, is well-suited for large-scale design space exploration.
The application of Shapley values for performance attribution is a novel contribution to the field of computer architecture, moving beyond simple parameter ablation studies.
The scale of the evaluation, involving millions of CPI evaluations across a wide range of benchmarks and a vast parameter space, is substantial.

Weaknesses

My primary concerns with this work center on the fundamental soundness of the analytical models, the unverifiable nature of the "ground truth," and the potential for the ML model to mask, rather than fuse, underlying physical realities.

Non-Reproducible and Unverifiable Ground Truth: The entire methodology is validated against a "proprietary gem5-based cycle-level trace-driven CPU simulator" (Section 3, page 3). This is a critical flaw. Without access to the simulator's source code and a detailed list of modifications from the public gem5 version, the paper's results are entirely non-reproducible. The validity of the claimed 2% error is contingent on a ground truth that the research community cannot inspect or verify.
Fundamentally Unsound Analytical Assumptions: The core of the feature engineering relies on analyzing each microarchitectural component "in isolation, assuming other CPU components have infinite capacity" (Section 2, page 3). This assumption discards the single most complex aspect of modern OoO processor performance: the interaction and back-pressure between components. For example, a full ROB stalls the frontend, and a congested memory system stalls the backend issue queues. The authors claim the MLP learns these "higher-order effects," but it is more likely learning to correct for the massive inaccuracies introduced by this flawed initial assumption.
The ML Model as a "Black Box" Corrector, Not a Fuser: The authors explicitly state, "the ML model serves as a powerful backstop that can mask significant errors in the analytical model" (Section 3.2, page 6). This is a telling admission. It suggests that the system is not a principled "fusion" but rather a collection of inaccurate heuristics whose errors are patched over by a black-box curve-fitting model. This severely undermines the claim that the model has learned the underlying physics of processor performance. Consequently, the Shapley value analysis (Section 6, page 11) may not be attributing performance to physical components, but rather quantifying the feature importance for the MLP's error-correction task, which is a fundamentally different and less meaningful conclusion.
Insufficient Memory Model: The trace analysis uses a "simple in-order cache simulation" augmented by a state machine (Algorithm 1, page 5) to estimate memory latencies. This approach is grossly inadequate for modeling a modern memory subsystem. It ignores MSHR occupancy, interconnect contention, memory controller scheduling policies, and the complex interactions between demand requests and hardware prefetchers. The authors acknowledge this results in a "Discrepancy in raw execution times" (Section 5.2.1, page 10), which they show is a major contributor to the model's error tail. The model's foundation is built on an inaccurate depiction of the most common performance bottleneck.
Misleading Claims of Speed: The headline claim of being "five orders of magnitude faster" pertains only to inference time and strategically ignores the substantial precomputation cost. As stated in Section 5.2.3 (page 10), precomputing features for a single 1M instruction region across the parameter space takes 3959 seconds on a single core. This is over an hour of computation for a tiny slice of a program. While this is a one-time cost per program region, it is not negligible and must be factored into any honest assessment of the method's overall efficiency, especially when considering new, unseen workloads.
Weak Out-of-Distribution (OOD) Generalization: The OOD experiment in Figure 14 (page 11) reveals the model's brittleness. An average OOD error of 26% for a memory microbenchmark (O3) and ~15% for real workloads like S1 and C2 is unacceptably high for a tool intended for architectural design. This suggests the model relies heavily on seeing similar program behaviors during training and does not generalize well to novel execution patterns.

Questions to Address In Rebuttal

Can the authors provide the source code for their proprietary simulator or, at a minimum, a detailed document specifying every deviation from the public gem5 O3 CPU model and Ruby memory system? Without this, the paper's claims are unreproducible.
The analytical models operate "in isolation." Can you provide evidence that the MLP is learning genuine physical interactions (e.g., ROB back-pressure) rather than simply learning a complex, non-linear function to correct for the inherent inaccuracies of the isolated input features?
Given your admission that the MLP can "mask significant errors" from the analytical models, how can you justify that the Shapley value attribution reflects the importance of physical microarchitectural components, rather than the importance of specific input features to the MLP's internal error-correction mechanism?
The memory model in Section 3.1 is a significant simplification. How would your model's accuracy, particularly in the error tail, be affected by workloads that heavily stress memory-level parallelism and induce significant memory controller or interconnect contention, which your current model ignores?
Please clarify the "true" end-to-end cost of your methodology. For a new, full SPEC benchmark run (trillions of instructions), what is the total time required for tracing, feature precomputation across all regions, and training before any inference can be performed?
The average error metric of ~2% obscures a tail of significant errors. What are the P95 and P99 relative CPI errors on your primary test set? In hardware design, worst-case errors are often more critical than the average case.

Review 2

Excellent. Here is a peer review of the paper "Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion," written from the perspective of 'The Synthesizer'.

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Concorde, a novel methodology for fast and accurate CPU performance modeling. The central problem it addresses is the long-standing trade-off between the speed and fidelity of microarchitectural simulators. Cycle-level simulators like gem5 are accurate but prohibitively slow for large-scale design space exploration, while existing analytical or ML-based approaches often compromise on accuracy or scalability.

Concorde's core contribution is a hybrid "compositional analytical-ML fusion" approach. Instead of learning from raw instruction sequences, it first uses a suite of simple, per-component analytical models to generate compact performance distributions (e.g., throughput bounds for the ROB, load queue, etc.). These distributions serve as rich, domain-aware features for a lightweight ML model (an MLP), which then predicts the final CPI by learning the complex interactions and higher-order effects between these components. This division of labor allows Concorde to achieve a remarkable speedup (over 5 orders of magnitude) compared to a cycle-level simulator while maintaining high accuracy (~2% average CPI error). The authors compellingly demonstrate the power of this speedup by enabling a first-of-its-kind, large-scale performance attribution analysis using Shapley values—a task that would be computationally infeasible with traditional simulators.

Strengths

Elegant Core Idea and a New Paradigm: The most significant strength of this work is its core conceptual contribution. The decomposition of the performance modeling problem is both elegant and powerful. It masterfully blends domain-specific knowledge (the analytical models) with the power of machine learning to handle complex interactions. This approach sidesteps the major pitfalls of its predecessors: pure analytical models fail on interaction effects, and pure ML models are data-hungry and struggle to learn first-principles from scratch (an O(L) problem they cleverly reduce to O(1)). This paper doesn't just present a tool; it proposes a new and highly promising paradigm for building performance models for complex systems.
Exceptional Performance and Scalability: The results are, frankly, spectacular. Achieving a >100,000x speedup while keeping prediction error in the low single digits is a massive leap forward. The constant-time O(1) inference cost with respect to program length is a fundamental advantage over sequence-based models like TAO, and the authors demonstrate this superiority clearly (Figure 8, p. 9). This level of performance is not merely an incremental improvement; it is transformative.
A "Killer Application" that Demonstrates True Impact: The paper's use of Shapley values for performance attribution (Section 6, p. 11-12) is a brilliant choice. It elevates the work from a simple "let's make simulation faster" paper to one that showcases why that speed matters. By enabling a rigorous, fine-grained analysis of performance bottlenecks that is currently impractical, the authors provide a compelling vision for the future of computer architecture research and design. This connects the work to the broader fields of cooperative game theory and explainable AI, demonstrating a mature understanding of its potential impact.
Strong Connection to the Literature and Context: The authors do an excellent job positioning their work. They clearly understand the landscape, from cycle-level simulators (gem5) and analytical models (roofline analysis) to recent sequence-based ML models (TAO). The comparison against TAO (Section 5.1, p. 9) is direct and effective, substantiating their claims about the advantages of their compositional approach.

Weaknesses

While the work is outstanding, there are areas where its context and limitations could be further explored. These are not so much flaws as they are important frontiers for this line of research.

Implicit Knowledge Engineering Cost: The paper describes the analytical models as "simple," but their creation is an act of expert-level knowledge engineering. The process of identifying the key components, formulating their behavior in isolation, and implementing the models requires deep microarchitectural expertise. This implicit cost could be a significant barrier to applying the Concorde methodology to a fundamentally different architecture (e.g., a VLIW processor or a dataflow accelerator). The paper would benefit from a discussion of the generalizability of the method for creating analytical models, not just the resulting performance model.
Dependence on a "Ground Truth Oracle": The Concorde framework, for all its speed, does not eliminate the need for slow, cycle-accurate simulators. It is a "model of a model," trained on data generated by a reference simulator. This is a crucial point: its accuracy is fundamentally capped by the accuracy of its training source, and it requires a significant upfront investment in simulation runs to generate the training dataset (as noted in Section 5.2.4, p. 11).
Limited Scope (Single-Core, CPI): The current work is scoped to single-core CPI prediction. Modern performance is heavily influenced by multi-core interactions, memory consistency, interconnects, and other system-level effects. While the authors acknowledge this as future work (Section 8, p. 13), it is the single largest question mark regarding the methodology's broader applicability. The interactions in a multi-core environment are precisely the kind of higher-order effects that are difficult to capture in simple analytical models, potentially making the ML fusion task significantly harder.

Questions to Address In Rebuttal

Regarding the creation of analytical models: Can the authors elaborate on the effort and expertise required to port this methodology to a new microarchitecture? For instance, how much of the analytical modeling framework for this OoO core could be reused for a radically different design, and what would be the most challenging new models to develop?
Regarding the generalization across the design space: The model is trained on randomly sampled microarchitectures. This is excellent for demonstrating generalizability. However, in practice, DSE often involves searching for non-obvious "corner" designs. How confident are the authors that the model, trained on a uniform random distribution, can accurately predict performance for clustered or sparse regions of the design space that might represent novel architectural ideas?
Regarding the path to multi-core systems: What do the authors foresee as the biggest conceptual challenge in extending Concorde to model multi-core systems? Would the primary difficulty lie in creating analytical models for phenomena like cache coherence and interconnect contention, or in the ML model's ability to fuse an exponentially larger set of interacting feature distributions?
Regarding the Shapley value analysis: The stability and convergence of Shapley value estimation can be sensitive to the underlying function being analyzed. Could the authors comment on how the inherent statistical error of the Concorde model might impact the fidelity of the resulting performance attributions? For example, could small CPI prediction errors lead to incorrect rankings of bottleneck components?

Review 3

Here is a review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents Concorde, a performance modeling methodology for CPUs that fuses simple, per-component analytical models with a lightweight machine learning (ML) model. The central claim of novelty lies in its "compositional" approach. Instead of feeding raw instruction sequences to a complex ML model (like prior sequence-based work) or relying solely on oversimplified analytical models, Concorde introduces a novel intermediate representation. For a given program trace, it runs a suite of isolated analytical models, each estimating the performance bounds imposed by a single microarchitectural component (e.g., ROB, Load Queue). This process generates a set of performance-bound timeseries, which are then compacted into fixed-size probability distributions (CDFs). These distributions, which characterize the program's behavior with respect to each component, serve as rich features for a simple MLP that predicts the final CPI. The paper also proposes a novel application of this fast model: using Shapley values for fine-grained, order-independent performance attribution to microarchitectural components.

Strengths

The paper's primary strength is the novelty of its core methodology. While both analytical and ML-based performance modeling are established fields, the specific method of fusion proposed here is, to my knowledge, new.

Novel Intermediate Representation: The most significant novel contribution is the use of distributions of performance bounds as the feature set connecting the analytical and ML stages. Prior art has typically used either highly aggregated program statistics (e.g., instruction mix, overall cache miss rates) as inputs to ML models or has attempted to learn directly from instruction embeddings. Concorde's approach (Section 2, page 3, Figure 1) of generating per-component throughput distributions is a genuinely new idea. It elegantly factors the problem: the analytical models capture well-understood, first-order effects in isolation, while the ML model's task is reduced to learning the complex, higher-order interactions between these components. This decomposition is a conceptual advance.
Breaking O(L) Complexity: This new representation provides a fixed-size "fingerprint" of a program region, irrespective of its length (L). This successfully breaks the O(L) inference complexity that fundamentally limits sequence-based models like TAO [71], representing a significant architectural shift in ML-for-systems design.
Novel Application of Game Theory to Microarchitecture: The use of Shapley values [78] for performance attribution (Section 6, page 11) is a novel and compelling application. Standard parameter ablation studies are notoriously order-dependent, a weakness the authors clearly illustrate (Figure 15). Applying a formal, theoretically-grounded method from cooperative game theory to attribute performance degradation fairly across interacting microarchitectural components is a new and valuable contribution to the field of performance analysis.

Weaknesses

While the compositional framework is novel, the individual constituent parts are not. The paper's claims of novelty rest entirely on the synthesis and the interface between existing concepts.

Constituent Models Lack Novelty: The individual analytical models described (Section 3.2, pages 5-6) are largely implementations of well-established principles. The ROB model is a straightforward instruction-level dynamical system, static resource models are simple ratios, and others are basic discrete-event simulations. These are functionally similar to concepts in first-order processor models (e.g., Karkhanis & Smith [48]) or roofline analysis [18]. The novelty is not in these models themselves, but in their deployment as parallel feature extractors for a subsequent ML stage. The paper should be clearer that the innovation is in the framework, not the underlying component models.
Conceptual Overlap with Hybrid Modeling: At a high level, the concept of fusing a domain-specific (analytical) model with a general-purpose data-driven (ML) model is not new and exists in other scientific fields (e.g., Physics-Informed Neural Networks). The paper's contribution is the specific, and very effective, instantiation of this paradigm for CPU performance modeling. The authors could strengthen their case by more explicitly positioning their work within this broader context and highlighting why their specific interface (the performance distributions) is the key enabler for this domain.

Questions to Address In Rebuttal

The core of the proposed method is the richness of the distributional features. Could a significantly simpler set of statistical features—for example, only the mean, variance, and 99th percentile of each component's throughput timeseries—achieve comparable accuracy? If so, this would challenge the claimed necessity and novelty of using the full CDF, suggesting the proposed feature set is overly complex. Please provide an ablation study on the feature representation itself.
The division of labor between the analytical and ML models seems somewhat ad-hoc. For instance, the effects of resource bottlenecks like ROB size are modeled analytically, while the impact of branch mispredictions is captured via auxiliary features for the ML model to learn (Section 3.2.2, page 7). What is the guiding principle for this separation? A more formal justification for what is modeled analytically vs. what is left to ML would strengthen the claim that this is a generalizable methodology rather than a collection of curated heuristics.
The Shapley value analysis is a powerful new tool for attribution, but its fidelity is fundamentally bound by the accuracy of the Concorde model. How might systematic biases or error modes in the Concorde model (e.g., under-prediction for memory-bound phases) lead to incorrect or misleading performance attributions from the Shapley value analysis?

GCStack+GCScaler: Fast and Accurate GPU Performance Analyses Using Fine-Grained Stall Cycle Accounting and Interval Analysis

Abstract

To design next-generation Graphics Processing Units (GPUs), GPU architects rely on GPU performance analyses to identify key GPU performance bottlenecks and explore GPU design spaces. Unfortunately, the existing GPU performance analysis mechanisms make it ...

ACM LINK

Reviews

Review 1

Here is the peer review written from the persona of 'The Guardian'.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose a two-part mechanism, GCStack+GCScaler, for GPU performance analysis and design space exploration (DSE). The first component, GCStack, purports to offer a more accurate breakdown of GPU stall cycles than existing methods by employing what it terms "fine-grained" and "lazy" stall classification. The second component, GCScaler, then uses this baseline analysis from a single detailed simulation, in conjunction with an existing interval analysis model, to scale the stall cycles and estimate performance for alternative GPU configurations without re-simulation. The work aims to blend the accuracy of detailed simulation with the speed of analytical models for rapid DSE.

Strengths

The paper correctly identifies and motivates a critical problem in the field: the limitations of existing coarse-grained, priority-based cycle accounting mechanisms (e.g., GSI) that can produce misleading performance insights, particularly by over-attributing stalls to memory.
The proposed approach of separating the problem into (1) generating an accurate baseline and (2) accurately scaling that baseline is methodologically sound in principle.
The evaluation is extensive in scope, covering 47 workloads and a 13-design DSE space, and comparing against multiple state-of-the-art simulation and analytical modeling techniques.

Weaknesses

My primary concerns with this paper lie in the justification of its core methodological claims and the interpretation of its evaluation results. The claims of accuracy and novelty appear to be built on questionable assumptions.

The "Fine-Grained" Stall Classification is an Averaging Scheme, not a Causal Analysis. The core of GCStack's novelty is its "fine-grained" classification (Section 4.2). However, Algorithm 1 reveals this is merely a process of evenly distributing stall cycles across all warp slots within a scheduler if no instruction is issued. This is an accounting trick, not a causal analysis. It assumes all stalled warps contribute equally to the scheduler's inability to issue, which is patently false. A single warp stalled on a long-latency DRAM access is not equivalent to multiple warps stalled on short-latency data dependencies. This averaging can obscure the true critical path just as much as a priority-based scheme, simply in a different way. The claim that this is inherently more "accurate" is unsubstantiated.
The Analytical Scaling Model for Idle/Sync Stalls is Overly Simplistic. In Section 4.3, the authors propose scaling EmptyWS and Sync stalls using the formula: scaleValue = (MemStall_new + ComStall_new) / (MemStall_base + ComStall_base). This assumes that idle time scales linearly with the simple sum of memory and compute stall times. This is a profound simplification that ignores the complex, non-linear interplay of system components. For example, alleviating a memory bottleneck can expose a previously hidden compute bottleneck or change warp scheduling dynamics entirely, altering the idle/sync profile in ways this formula cannot capture. There is no theoretical or empirical evidence provided to justify this model's validity across diverse architectural changes.
The Representative Warp Calibration Assumes an Invalid Invariant. The technique to adjust the representative warp using a "calibration offset" (Section 4.3, Figure 9) is flawed. It calculates an offset based on the baseline configuration and applies it to different design points. This implicitly assumes that the nature of inter-warp contention and its effect on the "representative" warp remains constant across the design space. Changing the SM count, L1 cache size, or memory system fundamentally alters these dynamics. An offset derived from one specific hardware configuration is unlikely to be valid for a significantly different one. This assumption requires rigorous validation, which is absent.
The Reported Average Error Obscures Significant Prediction Failures. The headline result of a 6.37% Mean Absolute Percentage Error (MAPE) is misleading. The per-workload chart in Figure 13 (page 10) clearly shows that for several workloads (e.g., LUD, SRAD1, SRAD2, 2D), the MAPE is far higher, approaching or exceeding 20%. An analysis framework that fails so dramatically for a non-trivial subset of applications cannot be considered robust. The authors must dissect and explain these failures, not hide them behind a low average.
The Real Hardware Validation is Inconclusive and Contradictory. In the validation against NCU profiles (Section 5.2, Figure 12), the authors dismiss discrepancies by vaguely citing "inconsistencies of the stall metrics." This is insufficient. For the SRAD2 benchmark, GSI's characterization of ComStruct and Sync stalls appears qualitatively closer to the NCU data than GCStack's. This directly contradicts the central narrative that GCStack is superior. The authors must provide a granular mapping of their metrics to NCU events and rigorously explain every discrepancy, especially those that weaken their claims.

Questions to Address In Rebuttal

The authors must address the following points directly and with specific evidence from their experiments:

Justify the "fine-grained" claim in GCStack. How does simply averaging stall cycles across all stalled warps (as shown in Algorithm 1, Line 21) provide more accurate causal insight into performance limiters than a priority-based scheme that attempts to identify the primary limiter? Provide a concrete example where GCStack's averaging correctly identifies a bottleneck that GSI's priority scheme misses.
The analytical scaling model for Idle and Sync stalls (Section 4.3) is a first-order approximation. Please provide evidence or a theoretical justification for why this simplification holds across significant architectural changes, such as halving the DRAM AMAT or doubling the number of SMs, where second-order effects are known to be significant.
The representative warp calibration (Figure 9) assumes the calculated offset is invariant across the design space. How is this assumption validated? Show data that demonstrates the stability of this offset as, for example, the L1D$ size is varied from 1x to 4x.
While the average MAPE is low, Figure 13 shows errors >20% for several workloads (LUD, SRAD1, etc.). Please explain the architectural or workload-specific reasons for these significant prediction failures. What specific interaction is GCScaler failing to model in these cases?
In the NCU validation (Figure 12), for SRAD2, GCStack appears to mischaracterize the Sync/ComStruct balance compared to NCU, while GSI appears closer. Please elaborate on the specific "inconsistencies of stall metrics" and reconcile why your proposed method seems less accurate than the prior art in this documented case.

Review 2

Excellent. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a two-pronged approach, GCStack and GCScaler, to address a long-standing challenge in computer architecture: achieving both speed and accuracy in GPU performance analysis and design space exploration (DSE). The authors identify two key shortcomings in the state of the art: (1) existing cycle accounting mechanisms (like GSI) use coarse-grained, priority-based attribution that often misidentifies performance bottlenecks by over-emphasizing memory stalls and ignoring concurrent issues; and (2) existing fast DSE methods (like the interval analysis in GCoM) build upon these flawed characterizations, leading to inaccurate predictions.

The authors' core contribution is a synergistic pipeline to resolve this. 1. GCStack provides a highly accurate, one-time characterization of a baseline GPU design. Its novelty lies in its "fine-grained" stall accounting, which distributes a stalled cycle's cost across all concurrently stalled warps, and its "lazy" classification of structural stalls, which defers attribution to the execute stage for greater accuracy. 2. GCScaler then leverages this high-fidelity baseline. It uses established interval analysis techniques to rapidly generate scaling factors for different stall categories, but applies them to the accurate and detailed stall profile from GCStack. It further enhances accuracy by introducing its own analytical models for idle and synchronization stalls, which are often poorly handled by representative warp models, and by calibrating the representative warp selection process using the detailed simulation results.

In essence, the work proposes spending a significant simulation budget once to create an accurate ground truth, and then using fast, analytical methods to intelligently scale that truth across a large design space. This hybrid methodology aims to provide the accuracy of detailed simulation with a speed approaching that of pure analytical modeling.

Strengths

Elegant Synthesis of Existing Fields: The primary strength of this work lies not in inventing a single new algorithm from scratch, but in its insightful synthesis of two distinct research areas: detailed microarchitectural cycle accounting and fast, analytical DSE. The authors correctly identify that a major source of error in analytical models is the "garbage in, garbage out" problem stemming from an inaccurate baseline characterization. By creating a superior baseline with GCStack, they elevate the potential of the entire interval analysis-based DSE workflow in GCScaler. This is a powerful and well-conceived central idea.
Addresses a Fundamental Problem: The speed-versus-accuracy trade-off is one of the most fundamental challenges in computer architecture research and industry practice. This paper tackles the problem head-on, proposing a solution that finds a highly effective sweet spot. The reported 32.7x speedup over simulation for a 1,000-design DSE with only ~6.4% error is a compelling result that would be of immediate practical value to GPU design teams.
A More Nuanced View of Performance Bottlenecks: GCStack's fine-grained accounting moves the field beyond the simplistic notion of a single "root cause" stall for any given cycle. By acknowledging and quantifying concurrently occurring stalls (as shown in Figure 6a, page 6), it provides architects with a more holistic and actionable view of performance limiters. This is a conceptual advance that could lead to more effective architectural optimizations, as architects can see the potential benefit of fixing not just the "primary" bottleneck, but secondary and tertiary ones as well.
Comprehensive and Rigorous Evaluation: The evaluation is a significant strength. The use of 47 diverse workloads, comparisons against multiple state-of-the-art academic tools (GSI, GCoM, HyFiSS), and validation against real hardware profiling (NCU) provides strong evidence for the authors' claims. The incremental analysis in Figure 13 (page 10), which shows the performance improvement at each stage of GCScaler's design, is particularly effective at demonstrating the value of each of their contributions.

Weaknesses

Inherited Limitations of Interval Analysis: While GCScaler cleverly improves upon the interval analysis framework, it is still fundamentally bound by its core assumption: that the behavior of a complex, multi-thousand-warp execution can be reasonably abstracted by a single "representative warp." The authors' calibration of the warp selection (Section 4.3, Figure 9, page 7) is a smart mitigation, but this abstraction remains the most fragile part of the methodology. The paper would be strengthened by a discussion of workload types for which this assumption is likely to break down (e.g., those with extreme phase behavior or highly divergent warp execution paths) and the potential impact on GCScaler's accuracy.
Positioning of Novelty: The paper's strength is its synthesis. However, it could do a better job of framing it this way. An outside reader might see the work as two separate, incremental improvements: a better cycle accounting method and a better scaling model. The authors should more explicitly emphasize that the primary novelty is the pipeline and the insight that a high-fidelity baseline fundamentally changes the calculus of what is possible with analytical scaling models.
Feasibility of Hardware Implementation: The discussion of a potential hardware implementation (Section 6.2, page 12) is a welcome addition that connects the research to practice. However, it feels somewhat speculative. Implementing the proposed counters, particularly for lazy structural stall classification which requires state tracking between pipeline stages, could introduce non-trivial complexity and timing challenges in a real high-frequency GPU design. This section might be better framed as a conceptual proposal for future work rather than a direct implementation roadmap.

Questions to Address In Rebuttal

The calibrated representative warp selection is a clever improvement. However, how does this approach fare on workloads with high phase behavior or multiple, distinct clusters of warp behavior where a single representative warp might be insufficient, even with calibration? Does the accuracy of GCScaler degrade gracefully in such scenarios?
GCStack's fine-grained accounting seems highly effective. Could the authors comment on its generality? How tightly is it coupled to the specific pipeline structure of the simulated GPU (e.g., the operand collector stage discussed in Section 4.2, page 6)? How would the methodology adapt to fundamentally different GPU architectures, such as those with different scheduling mechanisms (e.g., two-level schedulers) or pipeline organizations?
The core contribution appears to be the synergistic effect of using a better baseline (GCStack) for a scaling model (GCScaler). To isolate this benefit, could the authors provide data on the accuracy of a "hybrid" system where GCScaler's scaling and calibration logic is applied to a baseline generated by the prior state-of-the-art, GSI? This would directly quantify the error reduction that comes specifically from using GCStack's superior baseline and would powerfully reinforce the paper's central thesis.

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents GCStack and GCScaler, a two-part mechanism for GPU performance analysis and design space exploration (DSE). The authors claim that existing methods suffer from coarse-grained, priority-based stall accounting, which misidentifies bottlenecks.

The first proposed mechanism, GCStack, introduces a novel cycle accounting methodology. Its claimed innovations are: 1) fine-grained stall classification, which attributes a single stall cycle fractionally to all concurrently stalled warps rather than a single prioritized one, and 2) lazy structural stall classification, which defers the distinction between memory- and core-related structural stalls from the issue stage to the execute stage to improve accuracy.

The second mechanism, GCScaler, is a DSE framework that leverages the fine-grained CPI stack produced by GCStack. It scales this stack using scale factors derived from an existing interval analysis model (like GCoM [39]), but introduces what it claims are novel analytical models for idle and synchronization stalls and a refined method for selecting the representative warp.

The central thesis is that the combination of a more accurate, fine-grained baseline characterization (GCStack) with targeted analytical scaling (GCScaler) yields a DSE methodology that is significantly more accurate than prior analytical models and significantly faster than repeated cycle-level simulation.

Strengths

The primary strength of this work lies in the conceptual shift proposed by GCStack.

Fine-Grained Concurrent Stall Accounting: The core novel idea is moving away from the paradigm of attributing a stall cycle to a single, prioritized event (as seen in GSI [2] and HyFiSS [60]). The proposed method of distributing the cycle across all concurrently stalled warps (Algorithm 1, Line 21, Page 6) is a clear and meaningful departure. This directly addresses a known simplification in prior art and, as the data suggests, corrects for the systemic over-attribution to memory stalls. This is a genuine contribution to the field of GPU performance modeling.
Lazy Structural Stall Classification: The technique of deferring structural stall classification until the execute stage (Algorithm 1, Line 25, Page 6) is a specific and clever solution to a concrete microarchitectural problem—the ambiguity introduced by the operand collector. While not a paradigm shift, it is a novel micro-architectural accounting technique that demonstrably improves classification accuracy (Figure 7b, Page 6).
Targeted Refinements in GCScaler: While GCScaler is built upon the well-established foundation of interval analysis (GPUMech [29], GCoM [39]), it does not simply reuse it. The introduction of specific analytical scaling models for EmptyWS and Sync stalls (Section 4.3, Page 7), which are acknowledged weaknesses of coarse-grained interval models, represents a non-trivial delta over prior work. Similarly, the calibration of the representative warp selection using baseline performance data (Figure 9, Page 7) is a novel feedback mechanism that improves upon the standard k-means clustering approach.

Weaknesses

My critique is focused on carefully circumscribing the novelty and questioning the robustness of some of the new models.

Conceptual Precedent in CPU Accounting: The high-level concept of top-down or fractional stall accounting is not entirely new. Methodologies for CPU performance analysis, such as Intel's Top-Down Microarchitecture Analysis Method, have long sought to decompose stalls into concurrent frontend, backend, and memory-bound categories. While the application to a massively parallel, multi-warp GPU scheduler is certainly novel and presents unique challenges that the authors have addressed, the paper would be stronger if it acknowledged this conceptual lineage. Its novelty is in the specific GPU implementation, not the general idea of concurrent stall attribution.
GCScaler's Novelty is Incremental, Not Foundational: The paper honestly states that GCScaler "leverages the existing GPU interval analysis techniques." The innovation here is in the synthesis of their novel GCStack with improved analytical models. This is a valid and valuable engineering contribution, but it is an incremental advancement of the state-of-the-art in analytical DSE, not a fundamentally new approach. The novelty rests entirely on the quality of its targeted refinements.
Simplicity of Analytical Models: The proposed analytical model for scaling EmptyWS and Sync stalls appears to be a linear combination of the scale factors for memory and compute stalls (Section 4.3, Page 7). This is a simple first-order model. It is likely effective for the workloads tested, but its theoretical robustness is questionable. Highly divergent workloads with complex barrier synchronization patterns may exhibit non-linear relationships between compute/memory latency and idle/sync time. The novelty of this model is tempered by its simplicity and potentially limited scope.

Questions to Address In Rebuttal

Robustness of Lazy Structural Stall Classification: The lazy classification method relies on observing which functional units are unable to accept new instructions at the execute stage. In future, more complex GPU architectures with deeper pipelines or more sophisticated out-of-order dispatch mechanisms, could the causal link between an issue-stage structural stall and a full back-end resource become tenuous or delayed? How would your method disambiguate a stall caused by, for example, a full INT unit from a downstream dependency that has backed up the pipeline to the INT unit?
Limits of the Analytical Idle/Sync Scaling Model: Can the authors provide a theoretical justification for why a linear combination of memory and compute stall scaling factors is sufficient for modeling idle (EmptyWS) and synchronization (Sync) stalls? Please address a hypothetical case of a highly divergent kernel where reducing memory latency might disproportionately increase time spent waiting at a barrier, a non-linear effect. Have you identified any workloads where this model's accuracy degrades?
Hardware Implementation Feasibility: The proposed hardware implementation (Section 6.2, Page 12) requires counters at each warp scheduler. To perform the fine-grained fractional accounting, this logic must inspect the state of all warp slots and perform a division (or equivalent calculation) every cycle. While the SRAM overhead is convincingly low, what is the impact of this additional combinational logic on the scheduler's critical path and overall power consumption? A cycle-accurate scheduler is a highly timing-sensitive component, and adding logic to its core decision loop is a non-trivial design cost.

TrioSim: A Lightweight Simulator for Large-Scale DNN Workloads on Multi-GPU Systems

Abstract

Deep Neural Networks (DNNs) have become increasingly capable of performing tasks ranging from image recognition to content generation. The training and inference of DNNs heavily rely on GPUs, as GPUs’ massively parallel architecture delivers extremely ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the perspective of "The Guardian."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents TrioSim, a lightweight, trace-driven simulator for predicting the performance of large-scale DNN training on multi-GPU systems. The core methodology involves extrapolating operator-level execution traces collected from a single GPU to model various parallel execution schemes (data, tensor, pipeline) on multi-GPU configurations. The simulator combines an existing analytical performance model for computation (Li's Model) with a high-level, flow-based network model for communication. The authors claim high accuracy (errors typically below 10%) and simulation speeds of a few seconds.

While the goal of creating a fast and accessible multi-GPU simulator is commendable, the methodology rests on several simplifying assumptions whose impacts are not fully evaluated. The validation contains several concerning artifacts where high-error results are dismissed, and the core extrapolation premise appears fragile and potentially unable to capture critical second-order effects of distributed systems.

Strengths

Problem Significance: The paper addresses a critical and difficult problem. The cost and complexity of experimenting on large-scale GPU clusters make fast, accurate simulators highly valuable.
Usability Premise: The core design choice to require only a single-GPU trace as input dramatically lowers the barrier to entry for researchers without access to large hardware testbeds. This is a significant practical advantage if the methodology proves to be sound.
Breadth of Validation: The authors have made a commendable effort to validate TrioSim against multiple hardware generations (NVIDIA A40, A100, H100), network interconnects (PCIe, NVLink), and all three major parallelism strategies.

Weaknesses

Unjustified Dismissal of Mismatched Results: The treatment of pipeline parallelism results is methodologically unsound. The authors dismiss high-error results in Figure 10 (page 10) as "abnormal" and attribute them to "CPU scheduling overhead." This is problematic for two reasons. First, a simulator's purpose is to model real-world behavior, not just idealized theoretical expectations; if the hardware performance deviates from theory, the simulator should capture this deviation. Second, the authors explicitly state in their limitations (Section 8.2, page 13) that they do not model CPU overhead ("TrioSim only considers GPU compute and data movement... Estimating CPU overhead... requires dedicated future research."). This is a direct contradiction. The authors cannot use an unmodeled effect to selectively explain away the model's largest failures. This practice undermines the credibility of the other accuracy claims.
Fragility of the Single-Trace Extrapolation Premise: The fundamental premise—that a single-GPU trace contains sufficient information to model a distributed multi-GPU execution—is a strong and potentially fragile assumption. A single-GPU execution environment is sterile; it lacks system-level contention for shared resources like the PCIe bus, memory controllers, and CPU-GPU synchronization pathways that emerge only when multiple devices are active. For example, in a data parallel setup, multiple GPUs will concurrently launch kernels and perform DMA transfers, creating contention that is entirely absent in the single-GPU trace. The paper provides no evidence that its model can account for these critical emergent behaviors.
Oversimplification of Network and Communication Modeling: The network model is described as a "high-level flow-based network model" using Little's Law (Section 4.5, page 7), which "ignores protocol details." This is a significant simplification for modeling high-performance interconnects and libraries like NCCL. NCCL operations such as ring-AllReduce are complex, multi-stage processes involving careful buffer management and synchronization. It is not clear how a simple flow-based model can accurately capture the performance nuances of these algorithms, especially under contention. For instance, the performance of DDP relies on the fine-grained overlap of backward pass computation and communication. The paper states that it models this (Section 6, page 8) but fails to explain how the trace extrapolator and network model coordinate to simulate this overlap accurately.
Ambiguity in Core Technical Contributions: The paper's description of its own technical novelties is vague. In Section 4.4 (page 7), the authors state they "add a significant extension to Li's Model by supporting DNN training." It is unclear what this extension entails beyond simply applying the existing model to training-related operators. If the core computation model is largely adopted from prior work, this should be stated transparently. Without a clear description of the alleged extension, the contribution is difficult to assess.

Questions to Address In Rebuttal

Please provide a clear justification for the exclusion of the high-error pipeline parallelism data points marked with orange triangles in Figure 10. You must reconcile the use of "CPU overhead" as an explanation for errors in Section 6 with the explicit statement in Section 8.2 that CPU overhead is not modeled and is left for future work.
The paper's core premise relies on single-GPU trace extrapolation. How does the model account for system-level contention effects (e.g., on the PCIe bus, host memory controller) that are absent in a single-GPU trace but are known to impact performance in a real multi-GPU system?
Provide a more detailed explanation of how the flow-based network model simulates the multi-stage process of a ring-AllReduce operation. Specifically, how is contention handled when multiple data streams compete for link bandwidth, and how is the synchronization between steps of the collective modeled?
Clarify the exact nature of the "significant extension" made to Li's Model [34] for supporting DNN training. What specific algorithmic or modeling modifications were made beyond applying the existing model to operators found in a training workload?
In the cross-GPU validation (Figure 11, page 10), traces from A40/A100 are used to predict H100 performance. Li's Model relies on hardware parameters. How are these parameters obtained or estimated for the target (H100) architecture without profiling it, and what is the sensitivity of the model's accuracy to these parameters?

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents TrioSim, a lightweight, trace-driven simulator for large-scale DNN training workloads on multi-GPU systems. The central problem it addresses is the prohibitive cost and inaccessibility of large multi-GPU platforms for research, coupled with the infeasible slowness of traditional cycle-accurate simulators for this domain.

The work's core contribution is a methodology that requires only a single-GPU execution trace to simulate complex multi-GPU configurations. This is achieved through a novel "trace extrapolator" which programmatically generates the necessary computation and communication tasks for various parallelism schemes (data, tensor, pipeline) from this single-GPU baseline. TrioSim combines this extrapolator with an analytical performance model for GPU operators (extending Li's Model [34]) and a high-level, flow-based network model. The authors validate TrioSim against a variety of modern hardware (NVIDIA A40, A100, H100) and demonstrate low prediction error rates and simulation times on the order of seconds.

Strengths

This work makes a significant and timely contribution to the field of ML systems research. Its primary strengths lie in its conceptual framing and potential impact on the research community.

Democratization of Large-Scale Systems Research: The single most important strength of this paper is that it fundamentally lowers the barrier to entry for research into large-scale distributed training. The current state of the art requires either direct access to massive GPU clusters (the purview of large tech companies and a few well-funded labs) or relies on simulators like AstraSim and vTrain that require multi-GPU traces to begin with—a classic catch-22. By breaking this dependency and requiring only a single GPU for trace collection, TrioSim enables a much broader community of researchers to meaningfully explore the design space of future hardware and software systems.
Excellent Flexibility and Design Space Exploration: The single-trace methodology is not just about accessibility; it is a powerful tool for exploration. Because the multi-GPU behavior is extrapolated rather than profiled, researchers can programmatically vary key system parameters without needing to collect new traces. This includes network topology, parallelism strategy, batch size, and even cross-generational GPU performance prediction (as demonstrated in the validation on P3 in Section 6, page 9). The case studies in Section 7 (page 10), particularly the modeling of a novel photonic interconnect and the complex Hop protocol, are compelling demonstrations of this flexibility. TrioSim is not just a validation tool; it is a generative one for exploring "what-if" scenarios.
Pragmatic and Effective Hybrid Methodology: The authors have found a "sweet spot" in the simulation-versus-modeling spectrum. Instead of relying solely on a static analytical model, they embed their performance predictors within an event-driven simulation framework (Akita). This allows TrioSim to capture dynamic, system-level effects like network contention (as described in the network model in Section 4.5, page 7) that are often abstracted away in purely analytical approaches. This hybrid design is key to its accuracy, balancing abstract modeling for speed with dynamic simulation for capturing crucial interactions.
Connecting to a Broader Research Arc: This work fits perfectly within the ongoing evolution of performance modeling for computing systems. It represents a clear step forward in the lineage of trace-driven simulation, directly addressing the key bottleneck of its predecessors. By building upon established work like Li's Model [34] for computation and flow-based network simulation concepts, the paper smartly integrates existing, validated components into a more powerful and accessible whole.

Weaknesses

The weaknesses of the paper are largely inherent to its high-level, lightweight approach. They are acceptable trade-offs for the speed and flexibility gained, but they define the boundaries of the tool's applicability.

Abstraction of System-Level Overheads: As acknowledged in the discussion (Section 8.2, page 13), TrioSim's model is most accurate when the workload is heavily GPU-bound. It largely abstracts away CPU overhead, OS scheduler effects, and complex memory hierarchy interactions. For the target domain of large-scale DNN training, this is a reasonable assumption. However, for smaller models, smaller batch sizes, or workloads with significant data preprocessing or control logic on the CPU, the model's accuracy will likely degrade. The results for pipeline parallelism with many small micro-batches (Figure 10, page 10) hint at this, where external overheads begin to dominate and defy the expected scaling.
Fidelity of the "Magic" Extrapolator: The core innovation—trace extrapolation—is also a potential source of systemic error. The process of converting a single-GPU trace into a parallel one involves assumptions about how communication and computation will be scheduled. For example, the automatic division of layers for pipeline parallelism may not reflect the kind of careful, performance-aware manual partitioning an expert would perform. While the results are impressive, the extrapolation might miss subtle inter-dependencies or system-specific optimizations that occur in real multi-GPU execution.
Inherited Limitations of Component Models: The simulator's accuracy is fundamentally coupled to the accuracy of its sub-models. The reliance on Li's Model for operator performance means that any inaccuracies or unmodeled hardware features (e.g., structured sparsity support in new Tensor Cores) in that model will be inherited by TrioSim. This is not a flaw in TrioSim's design per se, but it positions the work as part of a larger ecosystem where progress in one area depends on progress in others.

Questions to Address In Rebuttal

Boundary Conditions of the Model: Could the authors elaborate on the failure modes of the trace extrapolator? For instance, how would TrioSim handle workloads that are not Bulk Synchronous Parallel (BSP) in nature? What about highly irregular or value-dependent communication patterns that cannot be easily inferred from a single-GPU trace? Defining the boundaries where the model's assumptions break down would strengthen the paper.
Extensibility for Future System Features: The paper demonstrates modeling a new network. How extensible is the framework to modeling other critical system aspects? For example, could TrioSim be extended to model memory capacity constraints, forcing it to simulate solutions like ZeRO-Offload? Or could it model the power/thermal implications of different scheduling decisions? Understanding the ease of adding new resource models would be valuable.
Handling of Collective Communications: The paper mentions that TrioSim recreates NCCL-style collectives. How does the extrapolator decide which specific algorithm to use (e.g., Ring, Double Binary Tree)? Since the optimal choice can be topology- and size-dependent, is this user-configurable, or does the simulator make a default assumption? A bit more detail on this part of the extrapolation process would be insightful.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents TrioSim, a lightweight, trace-driven simulator designed to predict the performance of large-scale DNN training workloads on multi-GPU systems. The central thesis and primary novel claim of the work is the ability to extrapolate complex, multi-GPU training scenarios—including data, tensor, and pipeline parallelism—from an operator-level execution trace collected on only a single GPU. This is achieved through a "trace extrapolator" that programmatically inserts communication and synchronization operations into a single-GPU compute trace, which is then simulated using an event-driven engine, a regression-based performance model extended from prior work (Li's Model [34]), and a high-level flow-based network model. The authors claim this approach provides high flexibility and speed while maintaining reasonable accuracy, thereby lowering the hardware barrier for multi-GPU systems research.

Strengths

The primary strength of this paper is its core methodological novelty, which directly addresses a significant practical limitation in prior art.

Novel Methodology ("Single-Trace Extrapolation"): The core idea of synthesizing a multi-GPU simulation from a single-GPU compute trace is a genuinely novel and valuable contribution to the field of DNN performance modeling. Prior trace-based simulators like AstraSim [60, 74], DistSim [41], and vTrain [3] all require traces collected from multi-GPU or multi-node systems, fundamentally tethering the simulation's scope to the availability of large-scale hardware. TrioSim's "trace extrapolator" (Section 4.3, page 5) breaks this dependency. This represents a significant delta over the state-of-the-art by shifting the problem from observing multi-GPU behavior to synthesizing it from a more accessible baseline.
Enabling Unconstrained Design Space Exploration: A direct consequence of the primary novelty is the simulator's flexibility. By not relying on a multi-GPU trace, users can freely modify network topologies (including asymmetrical ones, as noted in the Abstract), parallelism strategies, and batch sizes without the prohibitive cost of re-profiling on physical hardware for every configuration. This is a clear advantage over a tool like vTrain [3], which, as the authors correctly note, requires the parallelism settings to remain identical between profiling and simulation.
Clever Integration and Extension of Prior Art: The authors do not attempt to reinvent every component. They build upon Li's Model [34] for operator performance prediction. The novelty here is twofold: (1) extending an inference-focused model to the domain of DNN training, and (2) integrating this analytical model into a full, event-driven system simulator. This synthesis is non-trivial and represents a sound engineering and research contribution.

Weaknesses

The weaknesses of the paper are intrinsically linked to the abstractions required to enable its core novelty.

Novelty is Concentrated in the Extrapolator: While the overall system is effective, the novelty is highly concentrated in the "trace extrapolator" component. The other constituent parts are based on established concepts: the event-driven simulation paradigm (Akita Engine [67]), high-level flow-based network models [12, 16], and an operator performance model that is an extension of prior work [34]. The paper does not propose a fundamentally new way to model GPU compute performance or network physics; rather, it proposes a new way to compose these models for a new purpose. This is a systems contribution, not a component-level one.
Re-creation vs. Observation of Communication: The single-trace approach necessitates that collective communication patterns (e.g., ring-based AllReduce) are programmatically re-created by the extrapolator rather than being observed from a real system trace. As noted in Section 8.4, this means TrioSim models an idealized, open-sourced NCCL implementation. This abstraction may miss subtle but important performance effects from vendor-specific library optimizations, hardware-level topology awareness, or contention within the NIC/PCIe subsystem that a real multi-GPU trace would capture implicitly. The novelty of avoiding a multi-GPU trace comes at the cost of fidelity in modeling the communication subsystem.
Implicit Assumption of Component Composability: The methodology fundamentally assumes that the performance of a DNN operator (the compute part) and the performance of the network (the communication part) can be modeled separately and composed together without significant interaction effects. It also assumes an operator's core performance characteristics do not fundamentally change when its inputs are sharded for tensor or data parallelism. While the results suggest this is a reasonable approximation, this assumption is a potential source of error and a limitation of the proposed novel methodology, particularly for future architectures with tighter compute/communication integration.

Questions to Address In Rebuttal

The core novelty lies in the trace extrapolator's ability to inject communication tasks. How does the simulator ensure that its programmatic re-creation of collective communication algorithms faithfully models the complex, often hardware-specific implementations in libraries like NVIDIA's NCCL? For example, NCCL can dynamically select different algorithms (e.g., ring vs. tree) based on message size and topology. Does TrioSim model this decision-making, or does it assume a single, fixed algorithm like the ring-based one mentioned in Section 4.3?
The "single-trace extrapolation" approach is powerful but is predicated on phenomena observable from a single device. How would this methodology extend to model future multi-GPU systems with hardware features that only manifest during inter-GPU interaction? For example, technologies like in-network computing (e.g., NVIDIA SHARP) or direct, switch-arbitrated memory access between GPU pairs cannot be observed from a single-GPU trace. Is the proposed novel framework extensible to such cases, or does it represent a new boundary for this simulation methodology?
The paper states it extends Li's Model [34] to support DNN training (Section 4.4, page 7). Could the authors please elaborate on the specific novel contributions required for this extension? Were the primary changes adapting the model's feature set to include back-propagation and gradient-update operators, or were more fundamental changes required to the regression methodology itself to account for the different execution characteristics of training workloads?

HYTE: Flexible Tiling for Sparse Accelerators via Hybrid Static-Dynamic Approaches

Abstract

Specialized hardware accelerators are widely used for sparse tensor computations. For very large tensors that do not fit in on-chip buffers, tiling is a promising solution to improve data reuse on these sparse accelerators. Nevertheless, existing tiling ...

ACM LINK

Reviews

Review 1

Here is a review of the paper from the perspective of 'The Guardian'.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present HYTE, a hybrid static-dynamic framework for tiling sparse tensor computations on hardware accelerators. The central idea is to use a static offline scheduler, which leverages lightweight sampling, to determine an initial "near-optimal" tiling configuration (tile size, shape, inter-tile order, buffer allocation). This static configuration is then used to guide a runtime system that employs a simple dynamic tuning mechanism to adjust tile shapes on-the-fly, ostensibly to correct for estimation errors and adapt to local data sparsity variations. The paper claims significant performance improvements (3.3x-6.2x) over existing state-of-the-art tiling strategies.

Strengths

Motivation: The work correctly identifies a key challenge in sparse acceleration: the performance of a given tiling strategy is highly dependent on the specific sparsity pattern of the input tensors. The motivational analysis in Section 3 (Figure 1), while simple, effectively demonstrates that a one-size-fits-all approach to tile size, shape, and inter-tile order is suboptimal.
Comprehensive Baseline: The authors include "Static-Opt" as a baseline, which represents the best possible static tiling scheme found via an exhaustive search on their simulator. This is a commendable choice, as it provides a solid upper bound for any static approach and helps to contextualize the performance of the proposed scheduler.

Weaknesses

Conflicting Narrative and Unsubstantiated "Near-Optimal" Claim: The paper's core premise is internally inconsistent. It champions the static scheduler as being able to "identify a near-optimal initial tiling scheme" (Abstract, page 1), yet it is built upon sampling techniques that are admitted to have errors of up to 43% (Section 5.1, page 6). An estimation error of this magnitude fundamentally undermines the claim of finding a "near-optimal" solution. This contradiction is further highlighted by the results in Figure 6, where dynamic tuning is required to outperform the optimal static baseline for several matrices. This suggests the static scheme is not a minor refinement away from optimal, but can be significantly flawed, and the dynamic component is not a fine-tuning mechanism but a necessary corrective action. The paper cannot simultaneously claim the static scheduler is near-optimal while also showing the dynamic part is critical for overcoming its deficiencies.
Insufficiently Rigorous Dynamic Tuning Mechanism: The dynamic tuning mechanism described in Section 6.4 (page 9) is simplistic and its effectiveness is not rigorously justified. The use of four quadrant counters to estimate the hit rate of nine potential new tile shapes is a coarse heuristic. The decision to apply a change is based on an arbitrary 5% improvement threshold, with no sensitivity analysis or justification provided. Given that this mechanism is responsible for the performance gains on irregular matrices (and for beating Static-Opt), its design requires a much more thorough defense than what is presented.
Understated Complexity and Overheads: The paper downplays the complexity and cost of the proposed architecture.
- Hardware Complexity: The introduction of a global Tiling controller and per-tensor accessors (Figure 4) represents a non-trivial increase in control logic complexity compared to simpler tiling schemes.
- Metadata Overhead: The strategy for managing metadata for flexible tiling (Section 6.2, page 8) requires storing fiber segment positions in off-chip DRAM, which generates additional memory traffic. The cost model claims to account for this, but the paper provides no quantitative analysis of this overhead, which could be substantial for highly irregular and finely-tiled matrices.
- Scheduling Overhead: The authors characterize the offline scheduling time as "minor" (Section 8.4, page 13). However, an average of 16.2% of the total execution time is not negligible. For some matrices in Figure 14, the offline time appears comparable to the online execution time. This cost makes the proposed approach unsuitable for scenarios with dynamic inputs or where compilation latency is a concern.
Inadequate Validation of Core Estimation Method: The entire static scheduling framework rests on the accuracy of the effMAC and nnzCTk estimators (Algorithm 2). While the authors cite previous work, they fail to provide sufficient validation in the context of their accelerator cost model. A detailed analysis mapping the per-matrix estimation error to the resulting performance gap between HYTE's static schedule and the Static-Opt baseline is conspicuously absent. Without this analysis, the claim that the sampling is "effective" is unsubstantiated.

Questions to Address In Rebuttal

Please resolve the apparent contradiction in the paper's core claims. Is the primary contribution the static scheduler that finds a robust, near-optimal solution, or is it the dynamic tuning that corrects for an often-inaccurate static prediction? How can a schedule based on estimations with up to 43% error be considered "near-optimal"?
The dynamic tuning mechanism uses a 5% hit-rate improvement threshold to trigger a tile shape change. Please provide a sensitivity analysis for this parameter. Justify why 5% is the right value and explain how performance degrades with other values.
For irregular matrices like mouse_gene, HYTE outperforms Static-Opt. The paper attributes this to dynamic tuning's ability to adapt to local sparsity. This implies that any single static tile shape is inherently suboptimal for such matrices. Does this finding not invalidate the fundamental goal of the offline scheduler, which is to find a single, fixed initial scheme?
Regarding the metadata management in Section 6.2, please provide a quantitative analysis of the DRAM traffic generated solely by metadata for the most challenging matrices (e.g., kron_g500-logn18 or dielFilterV2real). How does this overhead compare to the tensor data traffic, and can you demonstrate that it does not become a system bottleneck?

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents HYTE, a hybrid static-dynamic framework for managing flexible tiling on sparse tensor accelerators. The core contribution lies in its pragmatic division of labor: a powerful offline static scheduler explores a comprehensive design space—including tile size, shape, inter-tile iteration order, and buffer allocation—to find a near-optimal initial tiling scheme. This is made possible by lightweight sampling techniques that estimate key performance metrics. This initial scheme is then executed on hardware that supports a lightweight dynamic tuning mechanism to correct for local data variations and efficiently manage the necessary tiling metadata. By decoupling the complex, global search (done statically) from the simple, local adjustments (done dynamically), HYTE aims to achieve the adaptability of dynamic approaches without their high complexity, and the low overhead of static approaches without their rigidity. The authors demonstrate significant performance improvements over state-of-the-art sparse tiling accelerators across a diverse set of benchmarks.

Strengths

Holistic and Principled Problem Formulation: The primary strength of this work is its holistic view of the sparse tiling problem. The authors rightly argue in Section 3 (page 3) that prior work has focused on a narrow subset of tiling parameters. HYTE, by contrast, co-optimizes tile size, shape, inter-tile order, and buffer allocation. The motivation presented in Figure 1 is particularly compelling, clearly illustrating that the optimal choice for each of these parameters is highly data-dependent. This comprehensive approach moves the field beyond point solutions and towards a more fundamental framework for tackling the problem.
Elegant Hybrid Architecture: The static-dynamic partitioning is the key insight that makes this holistic optimization tractable. Purely static schemes are brittle, while purely dynamic schemes are often too complex to support a rich design space. HYTE finds a "sweet spot" by using the static phase for the heavy lifting of global optimization and relegating the dynamic phase to simple, local course-correction. This makes the runtime hardware relatively simple (e.g., the counters and simple rules described in Section 6.4, page 9) while still reaping the benefits of adaptivity. This is a mature and pragmatic engineering approach.
Connecting to Broader CS Principles: The paper successfully integrates established techniques from other domains into the architecture community. The use of sampling-based estimation for effMAC and output tensor size (nnzCTk) in Section 5.1 (page 6) is a direct and effective application of principles from the database query optimization and data streaming literature (e.g., cardinality estimation). By bringing these powerful estimation techniques into a hardware scheduler, the authors provide a lightweight yet effective way to navigate the enormous search space without resorting to full simulation or expensive preprocessing.
Thorough and Convincing Evaluation: The experimental evaluation is robust. The authors compare HYTE against the three most relevant recent works (Tailors, DRT, HARP) and demonstrate substantial speedups (Figure 6, page 10). The inclusion of a brute-force "Static-Opt" baseline is excellent, as it shows that their scheduling algorithm is not just better than prior art, but also remarkably close to a statically-derived optimal solution. The breakdown in Figure 8 (page 11), which incrementally adds HYTE's features, effectively attributes the performance gains to the specific design choices, strengthening their claims.

Weaknesses

While the paper is strong, its focus on a specific class of problems opens up questions about its broader applicability, which could be strengthened by further discussion.

Coupling of Tiling and Intra-Tile Dataflow: The framework currently assumes the intra-tile dataflow (e.g., Gustavson's, IP, OP) is fixed. However, the optimal tiling scheme is deeply intertwined with the optimal dataflow. For example, a tiling choice that creates tall, skinny tiles might make an output-stationary dataflow highly efficient, while a different tile shape might favor an input-stationary one. Works like Spada [20] and Flexagon [24] have shown the benefits of dynamically selecting the dataflow. A truly holistic system would co-optimize tiling and dataflow together. While the authors acknowledge this as a potential extension, the current separation limits the scope of the "optimality" they can achieve.
Generalizability of the Scheduler: The scheduler's cost model and estimation techniques (ESTEFFMAC, ESTNNZCTK) are presented in the context of SpMSpM (a 3D loop nest). It is not immediately obvious how these lightweight sampling methods would generalize to more complex sparse tensor algebra expressions, such as the Matricized Tensor Times Khatri-Rao Product (MTTKRP) or other multi-tensor contractions. Adapting the scheduler to a broader set of primitives in a generalized tensor algebra compiler would likely require significant new estimation strategies.
Scalability of Metadata Management: The paper correctly identifies metadata management as a key challenge and proposes a coordinated data/metadata buffer management scheme (Section 6.3, page 8). While this is effective, the fundamental overhead of storing metadata pointers for highly flexible tiling schemes in off-chip memory (Section 6.2) could become a scalability bottleneck for petascale problems or extremely fine-grained tiling, potentially consuming a significant fraction of DRAM bandwidth. A brief analysis of this limitation would add valuable context.

Questions to Address In Rebuttal

Could the authors elaborate on how the HYTE scheduling framework might be extended to co-optimize the intra-tile dataflow alongside the tiling parameters? For instance, could the cost model be adapted to evaluate different dataflows for a given tile shape, and would this significantly increase the search complexity?
Regarding the generalizability of the sampling methods: For a more complex operation like C(i,j) = A(i,k,l) * B(k,l,j), how would the offline scheduler need to be modified? Would the existing sampling approach still be effective, or would it require fundamentally new estimators for the intermediate tensor products?
The offline scheduling time, while shown to be a reasonable fraction of the total execution time in Figure 14 (page 13), could still be a consideration for interactive workflows. Could you provide some absolute numbers for the scheduling time on the larger matrices and comment on how this time scales with matrix size and the desired sampling fraction?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present HYTE, a framework for tiling large sparse tensor computations on hardware accelerators. The core thesis is that existing approaches are either purely static (like Tailors [38]), leading to suboptimal performance on diverse sparsity patterns, or purely dynamic (like DRT [25]), incurring high hardware complexity and limited decision scope. HYTE proposes a hybrid solution: an offline static scheduler uses sampling to find a near-optimal initial tiling scheme, followed by a lightweight online dynamic tuner that makes minor adjustments at runtime. The key novelty claim lies in this specific hybrid formulation, where the static scheduler explores a significantly richer design space than prior work—including tile size, tile shape, inter-tile iteration order, and buffer allocation policies—and the dynamic component acts as a simple corrector rather than a primary tiling engine.

Strengths

The primary strength of this paper is its novel reframing of the sparse tiling problem. While the concepts of "static scheduling" and "dynamic tuning" are not new in isolation, their synthesis in this specific manner for sparse accelerators represents a genuine contribution.

Novel Hybrid Formulation: The core idea—using a comprehensive offline search to establish a strong baseline plan, which is then refined by a simple, low-cost online mechanism—is a clear departure from the state-of-the-art. Prior work has treated static and dynamic tiling as opposing philosophies. HYTE demonstrates they can be synergistic, with the static component handling global optimization and the dynamic component handling local variations. This is a conceptually new approach in this domain.
Expansion of the Tiling Design Space: The most significant novel contribution is the scope of the static scheduler. Previous static or dynamic approaches fixed many key parameters. For instance, Tailors [38] uses a fixed heuristic for tile shape expansion (k→j→i) and inter-tile order (j first). DRT [25] uses a greedy algorithm that results in cube-like tiles. In contrast, HYTE treats the tile shape, the inter-tile iteration order, and the buffer allocation among tensors as first-class, searchable parameters. Systematically exploring this multi-dimensional space (as shown in Figure 1, page 4) to find a data-specific configuration is a novel and powerful idea for sparse accelerators.
Lightweight Dynamic Correction: The dynamic tuning mechanism is novel in its simplicity and role. Unlike DRT [25], which requires complex micro-tiling and greedy online algorithms, HYTE's dynamic tuning is a simple correction mechanism based on quadrant counters. Its novelty lies not in its complexity, but in the insight that if the static plan is good enough, the dynamic component does not need to be powerful, thereby avoiding the high hardware costs of previous dynamic systems.

Weaknesses

While the overall framework is novel, some of its constituent components are built upon well-established prior art, and the paper could be more precise about where it is applying existing ideas versus inventing new ones.

Derivative Sampling Techniques: The use of sampling to estimate the non-zero count of sparse matrix products is not a new concept. The authors' method in Section 5.1 is functionally very similar to techniques described in prior work on sketching and graph analytics (e.g., Amossen et al. [2], Anh et al. [3]). The novelty here is the application of these known techniques to inform a hardware tiling scheduler and the extension to estimate nnzCTk (tiled access traffic). This is an engineering contribution, not a fundamental algorithmic one. The paper should more clearly position this as an application of known methods.
Heuristic-Driven Dynamic Tuning: The dynamic tuning mechanism, while novel in its simplicity, is based on heuristics (e.g., 2x scaling factor, 5% improvement threshold). There is little theoretical justification for these specific choices, and the work does not explore the sensitivity to these parameters or investigate alternative correction models. This makes the dynamic component feel more like an engineered trick than a principled mechanism, slightly weakening the novelty of that specific part.
Generalization to Multi-Level Hierarchies: The discussion of extending HYTE to multi-level buffers (Section 9, page 13) is speculative. The proposed "iterative" scheduling approach is an intuitive but unproven idea. As presented, the core novel contributions of HYTE have only been demonstrated and evaluated for a single-level global buffer, which limits the scope of its proven novelty.

Questions to Address In Rebuttal

The fundamental concept of a hybrid static-dynamic optimization approach is a classic pattern in computer systems (e.g., profile-guided optimization followed by JIT compilation). Could the authors more precisely articulate the delta between HYTE and prior sparse tiling systems like Tailors [38] and DRT [25]? Is the contribution truly a new paradigm, or is it a clever, albeit incremental, synthesis of the static analysis from the former and the runtime adaptability of the latter?
Regarding the estimation method in Section 5.1, the paper cites prior art [2, 3] for estimating the output size of SpGEMM. Is the contribution for estimating nnzCTk (the non-zero access traffic under tiling) a direct application of these known algorithms, or does it involve a novel modification to account for the effects of tiling that is not obvious from prior work?
The dynamic tuning mechanism is intentionally simple. Did the authors explore any alternative, perhaps more complex, runtime correction strategies? Please justify why the chosen heuristic (2x scaling, 5% threshold) is sufficient and robust enough to correct static estimation errors across the wide variety of sparsity patterns evaluated. Is there a risk that this simple mechanism is insufficient for matrices with highly irregular, non-local sparsity variations?

DX100: Programmable Data Access Accelerator for Indirection

Abstract

Indirect memory accesses frequently appear in applications where memory bandwidth is a critical bottleneck. Prior indirect memory access proposals, such as indirect prefetchers, runahead execution, fetchers, and decoupled access/execute architectures, ...

ACM LINK

Reviews

Review 1

Excellent. I am ready to assume the persona of 'The Guardian' and provide a rigorous, skeptical peer review.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper introduces DX100, a shared, programmable accelerator designed to improve performance for applications with indirect memory accesses. The core mechanism involves offloading bulk memory operations, which DX100 then reorders, coalesces, and interleaves to maximize DRAM bandwidth utilization by improving row-buffer hit rates and memory-level parallelism. The authors also present an MLIR-based compiler intended to automate the transformation of legacy code to use the accelerator. The paper claims substantial speedups of 2.6x over a multicore baseline and 2.0x over a state-of-the-art indirect prefetcher (DMP). However, a closer inspection reveals that these claims are predicated on a potentially biased evaluation methodology and several unsubstantiated assumptions regarding programmability, system integration, and scalability.

Strengths

Problem Formulation: The paper correctly identifies a well-known and persistent bottleneck in high-performance computing and data-intensive applications: the poor DRAM bandwidth utilization caused by irregular, indirect memory accesses.
Architectural Principle: The fundamental concept of using a large reordering window (up to 16K accesses) near the memory controller to improve DRAM locality is sound in principle. This is a logical extension of the reordering capabilities found in modern memory controllers, albeit at a much larger scale.
Compiler Effort: The inclusion of an MLIR-based compiler toolchain is a commendable effort towards addressing the critical issue of programmability for specialized hardware. This is a necessary component for any practical accelerator proposal.
Evaluation Infrastructure: The use of a cycle-accurate simulation framework combining Gem5 and Ramulator2 is appropriate for this type of architectural study.

Weaknesses

Unsupported Claims of "Automatic" Compilation: The paper claims the compiler can "automatically transform legacy code" (Abstract, page 1), but the evidence provided is insufficient. The "Legality" check described in Section 4.2 (page 8) relies on MLIR's alias analysis to ensure no data dependencies are violated. Alias analysis in the presence of pointers and array indexing, especially in C/C++, is a notoriously difficult problem and is often overly conservative. The paper provides no data on the compiler's efficacy on real-world, complex codes. It is unclear what percentage of indirect access loops in a large application would fail this legality check, thus preventing acceleration. The simple example in Figure 7 is not convincing proof of general applicability.
Potentially Unfair Comparison to State-of-the-Art: The 2.0x speedup claim over the DMP prefetcher is questionable. In Section 5 (page 8), the authors state, "Compared to the reported results [33], we observed less performance improvement for DMP with larger caches and more cores in our baseline configuration." This suggests that the baseline system used for comparison was not one where DMP performs optimally. Evaluating a state-of-the-art technique in a configuration that diminishes its effectiveness does not constitute a fair comparison. The burden of proof is on the authors to demonstrate that their solution is superior to a well-tuned competitor on the target system, not an untuned one.
Weak Coherence and Correctness Model: The proposed coherence mechanism is fragile. The paper asserts that correctness is maintained because "DX100 maintains exclusive write access to the indirect arrays within the ROI (Section 4.2 – Legality)" (page 7). This is not a hardware-enforced guarantee but a software convention imposed by the compiler. Any bug in the user's code or a limitation in the compiler's alias analysis that violates this assumption could lead to silent data corruption. Furthermore, the multi-DX100 coherence protocol described in Section 6.6 (page 11) is described as a "coarse-grained region-based coherence protocol" with no quantification of its performance overhead. The claim that it does not introduce "significant bottlenecks" is an unsubstantiated assertion.
Questionable Scalability: The scalability results in Figure 14 (page 12) are not as strong as implied. When moving from 4 to 8 cores with a single DX100 instance, the geometric mean speedup over baseline actually decreases slightly from 2.6x to 2.5x. This strongly suggests that the shared DX100 accelerator itself has become the central bottleneck, limiting system scalability. While adding a second DX100 instance improves performance to 2.7x, this merely sidesteps the analysis of the single-instance bottleneck. The paper lacks any analysis of internal contention within the DX100 (e.g., on scratchpad ports, controller logic) as the number of client cores increases.
Overly Optimistic Microbenchmark Design: The microbenchmark evaluation in Section 6.1 (page 9) appears to be constructed to showcase DX100 in the best possible light. The authors "generate a constant set of 64K unique B[i] indices to evenly distribute the indirect A[B[i]] words across 16 rows in all banks, bank groups, and channels." This synthetic, uniform distribution is an idealized scenario for reordering and interleaving. A more rigorous stress test would involve highly skewed or clustered access patterns that are common in real-world graphs and sparse data structures, which would likely challenge the effectiveness of the Row Table.

Questions to Address In Rebuttal

Compiler Efficacy: Provide quantitative data on the compiler's success rate. Across a suite of larger, more complex applications (beyond the evaluated kernels), what percentage of loops containing indirect accesses could be automatically and legally transformed by your toolchain? For those that failed, what were the primary reasons (e.g., conservative alias analysis, unsupported patterns)?
DMP Comparison Fairness: Justify the fairness of the DMP comparison. Please provide a sensitivity analysis of DMP's performance on your 4-core baseline system with different configuration parameters. Is it possible that a different set of tuning parameters for DMP would significantly close the 2.0x performance gap you report?
Coherence Overhead: Quantify the performance overhead of the coarse-grained coherence protocol used for the dual-DX100 configuration in Section 6.6. What is the breakdown of cycles spent on coherence messages, state verification, and region locking relative to the total execution time?
Shared Accelerator Bottleneck: Provide a detailed analysis of potential bottlenecks within a single DX100 instance as the core count scales to eight. What is the contention on the scratchpad ports, the instruction controller, and the request generation logic? How does this contention explain the performance plateau observed in Figure 14?
Workload Scope: The paper explicitly states in Section 4.1 (page 7) that DX100 does not support conventional pointer-chasing. Please clarify the scope of workloads DX100 can effectively accelerate. Given this limitation, how large is the class of important irregular applications (e.g., certain graph traversals, adaptive mesh refinement with pointer-based structures) that fall outside the capabilities of your proposed architecture?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces DX100, a programmable, shared data access accelerator designed to mitigate the memory bandwidth bottleneck caused by indirect memory accesses. The work correctly identifies that prior art, such as prefetchers and decoupled access/execute units, has primarily focused on hiding memory latency rather than improving the fundamental bandwidth utilization of the DRAM subsystem.

The core contribution is an architectural design that offloads bulk indirect accesses to a near-memory unit. This unit leverages its visibility into a large window of future accesses (a "tile" of up to 16K indices) to perform aggressive reordering, coalescing, and interleaving of DRAM requests. This strategy explicitly targets DRAM-level optimizations—improving row-buffer hit rates and maximizing channel/bank parallelism—that are impossible for conventional memory controllers with their limited request buffers. The authors support this architecture with a flexible ISA, a detailed microarchitectural design, and an MLIR-based compiler for automatic code transformation. The evaluation shows significant performance gains (2.6x over a multicore baseline, 2.0x over the state-of-the-art DMP indirect prefetcher) on a range of data-intensive applications.

Strengths

Excellent Problem Formulation and Positioning: The authors do a superb job of contextualizing their work. They clearly articulate why indirect accesses are a bandwidth problem, not just a latency problem (Section 2.1, page 2). By distinguishing their goals from traditional prefetching and runahead execution, they carve out a well-defined and important niche. This work understands that simply issuing random memory requests faster doesn't solve the underlying issue of poor DRAM command scheduling.
A Novel and Compelling Architectural Synthesis: DX100 is not merely an incremental improvement; it represents a thoughtful synthesis of several key architectural concepts. It combines the offloading principle of Decoupled Access/Execute (DAE), the reordering insights from software-managed techniques (like Milk [57]), and the near-memory placement of Processing-in-Memory (PIM) systems. However, unlike PIM, its goal isn't to compute in memory but to orchestrate data access for the host cores more intelligently. This is a powerful and pragmatic middle ground that feels architecturally plausible for future SoCs.
End-to-End System Design: A significant strength of this paper is its completeness. The authors present more than just a conceptual idea; they provide a concrete microarchitecture (Section 3, page 3), a general-purpose ISA to ensure programmability (Table 2, page 4), and a modern compiler toolchain to address the critical usability question (Section 4.2, page 7). This holistic approach, from hardware to compiler, makes the proposal far more convincing and demonstrates a deep understanding of what is required to make an accelerator practical.
Rigorous and Insightful Evaluation: The experimental methodology is strong. The use of microbenchmarks (Section 6.1, page 8) to isolate the benefits of instruction reduction versus memory bandwidth improvement is very effective. The main evaluation across 12 diverse workloads demonstrates broad applicability. The direct comparison against a state-of-the-art academic prefetcher (DMP) is precisely the right point of comparison and convincingly demonstrates the superiority of the bandwidth-oriented approach. The detailed breakdown of performance gains into improved row-buffer hit rate and request buffer occupancy (Figure 10, page 10) provides clear evidence that the proposed mechanisms are working as intended.

Weaknesses

While this is a strong paper, there are areas where the discussion could be deepened to better situate the work's trade-offs within the broader systems landscape.

Implications of a Shared Accelerator Model: The choice to make DX100 a shared resource is well-justified for area efficiency, but the paper could further explore the performance implications of contention. The scalability discussion (Section 6.6, page 11) addresses scaling the number of DX100 instances, but what happens when multiple cores contend for a single DX100? If one core offloads a large, easily-reordered workload while another offloads a smaller, more chaotic one, how does the internal scheduling manage this? The potential for inter-thread interference, priority inversion, or underutilization of the large reordering window due to fragmented requests from different cores seems like a rich area for analysis.
Programmability and Compiler Robustness: The inclusion of an MLIR-based compiler is a fantastic step towards usability. However, the paper acknowledges its limitations (e.g., inability to handle pointer-chasing) and the need for a manual API as a fallback. A deeper discussion on the "programmability cliff" would be beneficial. What percentage of loops in a larger HPC application suite like DOE proxy apps could be automatically compiled? How complex does the address calculation or control flow have to be before the compiler gives up? Understanding the boundary between automatic and manual transformation is key to assessing the true "cost of adoption" for this architecture.
System-Level Energy Efficiency Trade-offs: The paper provides an area and power analysis (Section 6.5, page 11), which is commendable. The 2MB scratchpad is a significant component, and the authors fairly compare its area against an equivalent amount of LLC. However, the analysis could be strengthened by moving from a component-level power number to a system-level energy efficiency metric (e.g., Energy-Delay Product). The performance gains are substantial, but does this come at a proportional energy cost, or does the system become genuinely more energy-efficient? Showing that the reduction in core instruction execution and DRAM stalls leads to a net system energy saving would be a very powerful result.

Questions to Address In Rebuttal

Regarding the shared accelerator design: Could the authors elaborate on how contention is managed within a single DX100 instance when multiple cores issue simultaneous offload requests? Is there a fairness mechanism, and have you studied the performance impact of interference between different threads' access patterns on the reordering efficiency?
Regarding the compiler: Can you provide more insight into the practical limitations of the current automatic compiler pass? For the benchmarks evaluated, were all key kernels successfully transformed automatically, or did some require manual use of the DX100 APIs? This would help gauge the maturity and real-world applicability of the compilation flow.
Regarding energy: While the power breakdown is useful, could you provide an estimate of the total system energy consumption for a given benchmark workload compared to the baseline and DMP configurations? This would clarify whether the impressive 2.6x speedup also translates into a more energy-efficient solution overall.

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces DX100, a programmable, shared data access accelerator designed to optimize indirect memory accesses. The authors' central claim of novelty rests on the architectural synthesis of several concepts: offloading bulk indirect accesses to a dedicated unit near the memory controllers, and leveraging the resulting large visibility window (up to 16K indices) to perform aggressive reordering, coalescing, and interleaving of DRAM requests. This approach aims to directly improve DRAM bandwidth utilization by increasing row-buffer hit rates, in contrast to prior art that has largely focused on latency hiding via prefetching. The accelerator is supported by a custom ISA and an MLIR-based compiler flow for automatic code transformation.

While many of the constituent ideas—decoupled access, memory reordering, near-memory acceleration—have been explored individually, the core novelty of this work lies in their specific combination to create a new architectural entity. The key differentiator is the use of a large, explicit batch of indices to fundamentally restructure memory access patterns for bandwidth, rather than predicting future accesses to improve latency.

Strengths

The primary strength of this paper is its novel architectural approach to a well-known problem.

A Shift in Problem Framing: The most significant novel contribution is the shift in focus from latency hiding to direct bandwidth optimization for irregular accesses. While prefetchers (e.g., DMP [33]) attempt to bring data into the cache to hide latency, they do little to alleviate the underlying chaotic access patterns hitting the DRAM. DX100 tackles the root cause by reordering a large batch of requests before they are issued, a fundamentally different and novel strategy.
Architectural Synthesis: The paper proposes a new point in the design space. It is not merely a prefetcher, nor is it a conventional DAE fetcher unit, nor is it a full-fledged near-memory processing engine. By creating a shared accelerator that sits between the cores' cache hierarchy and the memory controllers (Section 3, Figure 2), the authors have designed a system that bypasses core-side structural hazards (ROB, LSQ) and has a much larger visibility window than a DRAM controller. This specific architectural placement and function is, to my knowledge, novel.
Mechanism for Large-Scale Reordering: The Indirect Access Unit's design, particularly the Row Table and Word Table structures (Section 3.2, page 4-5), is a novel hardware mechanism designed explicitly to exploit the large (16K) index window. This is not a simple reorder buffer; it is a purpose-built data structure for identifying row locality and coalescing opportunities across a massive number of pending requests, which is a direct consequence of the novel programming model.

Weaknesses

The paper's claims of novelty could be challenged on the grounds of integration versus fundamental invention, and the comparisons to prior art could be more comprehensive.

Incrementalism vs. Invention: An argument could be made that DX100 is an extremely sophisticated and well-engineered integration of existing concepts rather than a fundamentally new invention. Decoupled Access/Execute (DAE) architectures [115] have long proposed separating memory access streams. Memory reordering is the primary job of a DRAM controller. The novelty lies purely in the scale and location of these operations. The authors should be careful to frame their contribution as a novel synthesis that enables a new capability, rather than implying the invention of reordering itself.
Insufficient Comparison to DAE/Fetcher Units: The paper's primary quantitative comparison is against a baseline and an indirect prefetcher (DMP [33]). While this is a good and relevant comparison, the architecture has strong conceptual overlap with modern fetcher units like Terminus [65] and SpZip [130]. The Related Work section (Section 7, page 12) acknowledges these but dismisses them qualitatively by claiming they "provide insufficient visibility" or are tightly coupled. However, a more detailed architectural and programming model comparison is warranted. These fetchers also offload access streams and can handle complex patterns. The key delta appears to be DX100's explicit focus on bulk reordering for DRAM bandwidth, but this distinction needs to be more sharply drawn against these closely related works.
Novelty of the Programming Model: The offloading model relies on identifying and hoisting bulk operations. This is conceptually similar to how GPUs operate or how vector instructions are generated. While the MLIR-based implementation is a solid piece of engineering, the underlying programming abstraction (offloading a bulk operation) is not in itself a new concept. The novelty is therefore constrained to the accelerator architecture that executes this abstraction.

Questions to Address In Rebuttal

Defense of Synthesis as Novelty: The paper effectively combines a DAE model, near-memory placement, and large-scale DRAM scheduling. Can the authors further articulate why this specific synthesis constitutes a singular, non-obvious contribution over prior art, as opposed to a skillful but incremental integration of known techniques from these disparate domains?
Clarifying the Delta from Fetcher Units: Please elaborate on the fundamental differences with a state-of-the-art fetcher unit like Terminus [65]. Beyond being shared, how does the core architectural goal of reordering for bandwidth in DX100 lead to a design that is qualitatively different from a fetcher designed for latency tolerance and fine-grained dependency management? For instance, could Terminus be augmented with a large reorder buffer to achieve similar results, or is the entire architecture of DX100 fundamentally different?
Generality and Limits of the ISA: The paper claims a "general-purpose ISA" (Abstract, page 1). However, the ISA presented in Table 2 (page 4) appears highly specialized for loop-based streaming and indirect accesses. Could you provide examples of irregular access patterns that DX100 cannot efficiently support (e.g., true pointer-chasing in a linked list, as noted in Section 4.1, page 7)? A discussion of these limitations would help to more precisely define the boundaries of the proposed novelty.

SEAL: A Single-Event Architecture for In-Sensor Visual Localization

Abstract

Image sensors have low costs and broad applications, but the large data volume they generate can result in significant energy and latency overheads during data transfer, storage, and processing. This paper explores how shifting from traditional binary ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper presents SEAL, a novel in-sensor computing architecture for visual localization frontends. The core proposal is to replace the conventional ADC-based sensor readout with an Analog-to-Time Converter (ATC) that feeds into a "race logic" temporal processor for denoising and edge detection. This is followed by a heavily quantized digital frontend for keypoint detection (GFTT) and tracking (LK). The authors claim significant reductions in latency (16-61x) and energy (7x) compared to baseline systems, while maintaining comparable accuracy on standard VIO benchmarks like EuRoC.

While the approach of integrating temporal logic at the sensor level is interesting, this paper's central claims are predicated on several questionable methodological choices and optimistic interpretations of the results. The evaluation framework contains unfair comparisons, and the accuracy claims are not as robust as the authors suggest.

Strengths

End-to-End System Evaluation: The authors are to be commended for evaluating their proposed frontend not as an isolated component but within two complete, well-known VIO frameworks (HybVIO and VINS-Mono). This provides a valuable system-level perspective.
Exploration of Temporal Logic: The paper explores an unconventional computing paradigm (race logic) for a practical application, moving beyond simple proof-of-concept demonstrations to a full system design.
Detailed Implementation Analysis: The paper provides detailed area and energy breakdowns for its components, synthesized in a modern process node, and includes an FPGA prototype for verification.

Weaknesses

My analysis reveals several significant weaknesses that undermine the paper's conclusions.

Unfair Baseline for Analog-to-Digital Conversion: The cornerstone of the claimed energy savings is the comparison in Table 2 (page 6) between a conventional SS-ADC and the proposed SEAL ATC. The baseline SS-ADC is assumed to have a 100 µs conversion time, while the SEAL-ATC has a 100 ns conversion time. A 100 µs conversion is exceedingly slow for modern high-speed image sensors. This three-orders-of-magnitude difference in operating speed appears to be an artificially chosen worst-case baseline that inflates the proposed system's benefits. A fair comparison would require benchmarking against a high-speed SS-ADC designed for a comparable frame rate.
Overstated Accuracy Claims and Cherry-Picking of Averages: The abstract claims SEAL "preserves robust tracking accuracy," citing an average RMS ATE decrease of 1.0 cm for HybVIO. This average masks significant performance degradation on several sequences. As seen in Table 10 (page 12), on the MH_01 sequence, HybVIO's error increases by 17% (from 24 cm to 28 cm). On V1_01 for VINS-Mono, the error increases by 50% (from 8 cm to 12 cm). An architecture that introduces such large, sequence-dependent errors cannot be described as preserving "robust" accuracy. The use of an average value to obscure these critical instances of failure is misleading.
Conflation of Proposed vs. Hypothetical Designs: The paper presents a hardware design with a fixed edge threshold (N). However, in Section 5.3.2 and Table 11 (page 12), the authors show that a flexible edge threshold (which their hardware does not implement) is required to achieve the best accuracy, improving it by 16.4%. This is a critical flaw: the authors are using the superior results of a more complex, hypothetical design to justify the accuracy of their simpler, implemented one. The paper should evaluate the actual design that was implemented, not a "what if" scenario. The claim that implementing this is "beyond the scope of this work" is an insufficient defense for this methodological inconsistency.
Optimistic and Inequitable Hardware Comparisons: In Table 6 (page 11), the authors compare their synthesis-based area and energy estimates for SEAL against the published, post-layout, and measured results of ASICs like Navion and RoboVisio. It is well-established that pre-layout synthesis results are optimistic and do not account for routing overhead, clock-tree power, or physical design challenges. Furthermore, scaling results from different technology nodes (e.g., 65 nm for Navion) using a generic tool is an approximation at best. This is not an apples-to-apples comparison and casts doubt on the magnitude of the claimed hardware benefits.
Unaddressed Sensitivity to Analog Non-Idealities: The entire temporal processing pipeline relies on converting pixel voltage into a clean temporal delay via the ATC. The paper completely fails to address the impact of analog noise, comparator input offset voltage, or timing jitter on the race logic computations. These real-world effects would directly corrupt the "values" being processed, and their absence from the analysis suggests the simulation environment may be overly idealized.

Questions to Address In Rebuttal

The authors must address the following points directly and precisely:

Justify the choice of a 100 µs conversion time for the baseline SS-ADC in Table 2. Provide citations for modern, high-speed image sensors used in VIO/robotics that employ ADCs with this slow of a conversion time.
How do you defend the claim of "preserving robust tracking accuracy" when your architecture results in a 50% increase in trajectory error on specific EuRoC sequences (e.g., V1_01 for VINS-Mono)?
Please clarify why the main results of the paper are based on a fixed-threshold design, while a separate evaluation (Table 11) is used to show that a flexible-threshold design is superior. What are the estimated area, latency, and energy costs to implement the flexible-threshold capability in hardware?
How can you claim a fair comparison in Table 6 when comparing your pre-layout synthesis estimates against the measured, post-layout results of fully realized ASICs from prior work? Please provide a more conservative, caveated analysis.
What analysis has been performed to characterize the robustness of your ATC and race logic pipeline to analog noise and timing jitter, which are unavoidable in a physical implementation? What is the expected degradation in accuracy?

Review 2

Excellent. I will now embody the persona of "The Synthesizer" and provide a comprehensive review of the research paper.

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces SEAL, a novel architecture for in-sensor visual localization that proposes a fundamental shift in the early stages of the vision processing pipeline. The core contribution is the end-to-end co-design of a system that replaces traditional binary encoding with a delay-based, temporal encoding scheme inspired by race logic. This temporal paradigm begins at the point of sensing itself, with a custom Analog-to-Time Converter (ATC) replacing the standard ADC. The resulting delay-coded signals are then processed through a specialized "temporal processor" for denoising and edge extraction, followed by a heavily quantized digital frontend for keypoint detection (GFTT) and tracking (LK).

The authors claim this holistic approach dramatically reduces the data volume, latency, and energy typically associated with transferring and processing raw image data. By performing the vision frontend tasks entirely within the sensor, SEAL sends only the final keypoint tracks to the host processor. The evaluation, a commendable mix of analog simulation, digital synthesis, and full-system analysis, demonstrates a significant 7x reduction in sensor energy and a 16-61x reduction in frontend latency, all while maintaining tracking accuracy comparable to state-of-the-art software VIO frameworks like HybVIO and VINS-Mono.

Strengths

A Powerful and Cohesive Core Idea: The single most important contribution of this work is the principled, top-to-bottom application of a temporal computing paradigm to the visual localization problem. Instead of treating in-sensor computing as merely moving a conventional digital accelerator closer to the pixels, the authors have re-imagined the data representation itself starting from the analog domain. The co-design of the ATC with the downstream race logic circuits (Section 3.1.1, page 5) is a particularly elegant example of this systems-level thinking. This approach creates a virtuous cycle: the temporal encoding enables efficient, massively parallel processing, which in turn justifies the custom conversion hardware.
Bridging a Key Gap in the Literature: SEAL occupies a fascinating and underexplored middle ground between two dominant trends in advanced vision sensors. On one side are purely analog in-sensor processors (e.g., RedEye [35]), which offer high efficiency but face challenges with scalability and programmability. On the other side are event-based cameras (e.g., Dynamic Vision Sensors), which are temporally efficient but capture information about change rather than absolute intensity, often requiring entirely new algorithms. SEAL cleverly combines the strengths of both worlds: it leverages a temporal, event-like signal (a single 0->1 transition per pixel per frame) to gain efficiency, but that signal encodes absolute intensity, making it fully compatible with the vast body of established, frame-based vision algorithms like GFTT and LK. This is a significant contribution to the field of computational sensing.
Exceptional System-Level Co-Optimization: The paper is a masterclass in holistic design. The benefits cascade through the system:
- Replacing the ADC with a simpler ATC and removing the TDC frees up area and power (Section 3.1, page 5).
- Race logic's "single-wire-per-variable" property enables massively parallel median filtering and edge extraction with minimal hardware (Sections 3.2 and 3.3, page 6).
- The binarized edge map produced by the temporal processor naturally enables aggressive quantization in the digital GFTT and LK frontend, leading to tiny, efficient hardware (Section 4, pages 7-9). This chain of co-optimizations is what makes the final system so compelling and efficient.
Thorough and Convincing Evaluation: The authors have gone to great lengths to validate their claims across multiple levels of abstraction. The combination of Cadence simulations for the analog components, Synopsys synthesis for the digital logic, and full-system VIO framework evaluation on multiple host CPUs provides a robust and credible assessment of the system's performance. The direct, scaled comparisons to strong prior work like Navion [69] and RoboVisio [81] (Table 6, page 11) clearly situate the work and highlight its substantial advantages in latency and energy.

Weaknesses

Limited Contextualization Against Event-Based Vision: While the paper successfully differentiates itself from traditional digital accelerators, it misses an important opportunity to discuss its relationship to the field of event-based vision. Dynamic Vision Sensors (DVS) are a major alternative for low-latency, low-power vision. A discussion of the trade-offs would significantly strengthen the paper's positioning. For instance, SEAL provides dense information every frame (at a fixed rate), whereas DVS provides sparse, asynchronous data. This makes SEAL better suited for classic algorithms but potentially less efficient in static scenes. This contextual link is a missing piece in an otherwise comprehensive paper.
The Practicality of Static Thresholding: The authors rightly identify that a flexible, adaptive edge threshold can significantly improve accuracy (Table 11, page 12). However, the current hardware implementation relies on a fixed threshold N. While presented as future work, this is a non-trivial limitation. Real-world scenarios involve dramatic changes in lighting, which would necessitate dynamic thresholding for robust performance. The paper would be more complete if it discussed the potential hardware pathways and overheads to implement such adaptivity (e.g., by modulating the ramp generator's slope, as hinted at in Section 3.3, page 7).
Uncertain Scalability to Richer Vision Tasks: The architecture is brilliantly optimized for corner detection and tracking, which rely on spatial gradients. However, the aggressive binarization of the image into edges discards a vast amount of texture and intensity information. It is unclear how this paradigm would extend to more complex vision tasks like object recognition, semantic segmentation, or even descriptor-based feature matching (e.g., ORB, SIFT), which rely on this richer information. While this is outside the paper's direct scope, acknowledging this boundary and discussing the potential of the underlying temporal representation for these tasks would provide a more complete picture of the paradigm's potential and limitations.

Questions to Address In Rebuttal

Could the authors elaborate on the conceptual trade-offs between their frame-based temporal encoding approach and the asynchronous, change-based approach of Dynamic Vision Sensors (DVS)? In what scenarios would SEAL be fundamentally more advantageous, and vice-versa?
The analysis of a flexible edge threshold in simulation shows clear accuracy benefits (Table 11, page 12). Could you briefly discuss the potential hardware complexity or circuit-level modifications required to implement this adaptivity in the SEAL architecture? Would this negate a significant portion of the area or energy savings?
The core of SEAL's frontend is an efficient edge extractor. Could you speculate on how the temporal processing paradigm might be adapted to preserve more pixel-level information (beyond a binary edge map) to support more complex, texture-dependent computer vision tasks in the future? For instance, could multiple thresholds be used in the temporal domain to produce a quantized intensity map?

Review 3

Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The central thesis of this paper is the introduction of a novel in-sensor computing architecture, SEAL, which leverages a co-design of analog-to-time converters (ATCs) and race logic to create a "temporal processor." This processor performs denoising and edge extraction directly on delay-coded signals from the pixel array, bypassing the need for a full Analog-to-Digital Converter (ADC) and large SRAM buffers for raw pixel data. The resulting binarized edge map is then consumed by a heavily quantized, but algorithmically conventional, digital frontend processor that executes GFTT keypoint detection and LK tracking. The authors claim this single-event architecture provides substantial reductions in latency and energy for the visual localization frontend while maintaining accuracy.

Strengths

The primary strength of this work lies in its architectural novelty. The core idea is genuinely new and represents a significant departure from existing approaches to in-sensor computing.

Novel Computational Paradigm at the Sensor Interface: The concept of halting the conventional analog-to-digital conversion process after the ATC stage and directly feeding delay-coded signals into a temporal processor built with race logic is, to my knowledge, a new contribution to the field of in-sensor computing. Standard architectures either use full ADCs to bring data into the digital domain for processing (e.g., [1], [34]) or perform computation in the analog domain (e.g., RedEye [35], LeCA [45]). SEAL carves out a new, distinct space between these two extremes.
Synthesis of Existing but Disparate Concepts: While the authors correctly cite prior art for race logic (e.g., Madhavan et al. [48], Tzimpragos et al. [72]) and acknowledge the components of a single-slope ADC, the synthesis of these concepts into an end-to-end pipeline for visual localization is original. Prior work on race logic has focused on accelerating dynamic programming, decision trees, or convolutions, but has not been integrated this tightly with the sensor's analog front-end for a computer vision application pipeline.
Insightful Co-design: The co-optimization between the analog ATC and the digital temporal processor (Section 3.1.1, page 5) is a particularly insightful element of the proposed novelty. By recognizing that a faster ramp/comparator in the ATC compresses the time domain, the authors demonstrate how this analog-level decision directly reduces the hardware cost (e.g., number of inverters in a delay chain) of the subsequent race logic processor. This demonstrates a thoughtful, cross-layer co-design that goes beyond simply connecting pre-existing blocks.

Weaknesses

While the core idea is novel, its presentation and the evaluation of its novelty could be sharpened.

Nuance in the "Fully Digital" Claim: The paper's claim of being a "fully digital" solution requires nuance. The architecture's foundation is the analog-to-time converter, which includes an analog ramp generator and comparator. The "digital" computation in the temporal processor operates on signals whose information is encoded in analog time delays. While the logic gates themselves are digital, the system is fundamentally a hybrid analog-temporal-digital one. This distinction is important, as the system is still susceptible to analog noise and process variation at its very core, a point not deeply explored.
Specificity of the Novelty: The novelty is tightly coupled to a specific class of algorithms (median filtering, Sobel-like edge detection) that map cleanly to the min/max/increment operations of race logic. It is unclear how this temporal processing paradigm would extend to more complex front-end tasks, such as learned feature extraction, which often rely on multiply-accumulate (MAC) operations. While recent work has explored temporal convolutions [16], the approach in SEAL seems specialized, potentially limiting the generality of this novel architecture.
Understated Design Complexity: The paper understates the design and verification challenges associated with temporal and race logic circuits. While implemented with standard cells, timing closure and analysis in such a paradigm are non-trivial compared to standard synchronous design. The novelty comes at the cost of adopting a less mature and tool-supported design methodology, a trade-off that should be more explicitly discussed.

Questions to Address In Rebuttal

Generality of the Temporal Processor: Could the authors elaborate on the applicability of the temporal processor beyond median filtering and gradient-based edge extraction? How would one implement, for instance, a 3x3 convolution with arbitrary signed weights within this paradigm without losing its claimed efficiency benefits over a conventional post-ADC digital implementation?
Sensitivity to Analog Variations: Please clarify the boundary of the "fully digital" claim. Given the critical role of the analog ramp generator and comparator in defining the time base for the entire temporal processor, how sensitive is the overall system's accuracy to analog process variations, temperature drift, and power supply noise?
Exploration of Prior Art on ATC-based Computation: While the integration with race logic appears novel, has any prior work explored using the direct, non-digitized output of an ATC for any form of computation, even if not race logic? The core idea of "computing on the delay" generated by a ramp comparator may have appeared in other niche domains, and a more thorough search is warranted to precisely define the paper's delta.
Robustness of Binarized Frontend: The frontend processor's effectiveness relies on heavily binarized/ternarized data derived from a simple edge threshold (N). How does this approach fare in texture-rich or low-contrast scenes where simple edge information might be insufficient and a richer, grayscale representation is typically required for robust feature tracking? The novelty appears to force an aggressive, early-stage quantization whose failure modes are not fully explored.

IDEA-GP: Instruction-Driven Architecture with Efficient Online Workload Allocation for Geometric Perception

Abstract

The algorithmic complexity of robotic systems presents significant challenges to achieving generalized acceleration in robot applications. On the one hand, the diversity of operators and computational flows within similar task categories prevents the ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors propose IDEA-GP, an instruction-driven architecture for geometric perception tasks like SLAM and SfM. The core idea is to use an array of unified Processing Elements (PEs), designed around 3x3 matrix and 3x1 vector operations, which are common in pose-related calculations. A key feature is a compiler that performs what the authors term "online workload analysis" to partition these PEs between frontend (Jacobian/residual computation) and backend (sparse linear solve) tasks, aiming to balance the pipeline. The architecture is implemented on an FPGA and evaluated against CPU baselines, claiming significant speedups.

Strengths

The paper correctly identifies a significant and relevant problem: the workload imbalance between frontend and backend stages in optimization-based SLAM/SfM, and the need for architectures that can adapt to different problem structures.
The fundamental approach of decomposing complex geometric optimization problems into a set of primitive matrix-vector operations is a sound architectural principle.
The evaluation is performed using standard, publicly available algorithms (VINS-Mono, OpenMVG) and datasets (EuRoC), which provides a degree of reproducibility.

Weaknesses

My primary concerns with this submission are the overstatement of generality, the simplistic nature of the core technical contribution (the "online" workload allocation), and an inadequate comparison to state-of-the-art baselines.

Insufficient Justification for GPU Baseline Dismissal: The authors dismiss GPUs in a single sentence in Section 9 (page 9), stating they offer "only limited acceleration" and citing a GitHub issue. This is wholly insufficient. Modern GPUs are highly effective at sparse matrix computations, and a rigorous comparison against a well-optimized GPU implementation is a critical missing piece of the evaluation. Without it, the claimed speedups over general-purpose CPUs are not properly contextualized within the field of high-performance computing. The provided evidence is anecdotal at best and does not constitute a valid technical argument.
Oversimplified "Online Workload Analysis": The core claim of "efficient online workload allocation" is predicated on the compiler model described in Section 8 (page 9). However, the model presented in equations (6) and (7) is a simple weighted sum based on the count of pre-defined residual types. The weights (αi and βi) are derived from a "pre-built knowledge base." This is not "online analysis" of the problem structure; it is a parameter lookup. The model's effectiveness is entirely dependent on this offline characterization. The paper provides no information on how these parameters are derived, how they generalize to new sensor modalities or residual types, or how robust the model is. This significantly weakens the paper's central claim of a dynamic and intelligent allocation mechanism.
Questionable Generality of the PE and ISA: The paper claims the PE design and architecture are "general" (Section 4, page 2). However, the entire dataflow and ISA are built around a specific solution strategy: Gauss-Newton with a Schur complement-based solve. It is unclear how this architecture would perform with other important classes of solvers, such as direct sparse Cholesky factorization on the full Hessian matrix, or non-linear solvers like Levenberg-Marquardt which require different steps. The primitive operations in Table 1 (page 5) are tailored to this specific workflow. Therefore, the claim of generality for "geometric perception" seems overstated; it is an architecture for a specific family of BA-style problems.
Contradictory Claims on Scalability and Bottlenecks: In Section 9.3 (page 11), the authors claim the architecture is scalable but immediately present evidence to the contrary. Figure 14 clearly shows that backend performance saturates quickly as more PEs are added (beyond ~12 PEs for the backend, in the VINS-Mono case). The authors themselves attribute this to "bandwidth constraints." This contradicts the claim of scalability and also casts doubt on the bandwidth efficiency argument made in Section 9.4. If the system is already bandwidth-bound with only 24 total PEs, its ability to scale to larger, more complex problems is severely limited. The analysis fails to differentiate between on-chip and off-chip bandwidth limitations, which is a critical detail.

Questions to Address In Rebuttal

Regarding the workload model (Section 8, page 9): Please provide a detailed explanation of how the cost coefficients (αi, βi) in equations (6) and (7) are derived. How robust is this model to new, uncharacterized residual types or different problem structures? What is the sensitivity of the overall performance to errors in these pre-computed coefficients?
The GPU baseline was dismissed with a reference to a GitHub issue (Section 9, page 9). Please provide a quantitative comparison against a well-optimized GPU implementation (e.g., using cuSPARSE or similar libraries) on the same tasks to properly contextualize the reported speedups.
How would the proposed IDEA-GP architecture handle optimization problems that are not easily solved via the Schur complement on the BA graph, such as those requiring a direct sparse Cholesky factorization of the Hessian? Is the ISA expressive enough to support control flows for alternative linear solvers?
Section 9.3 (page 11) states that backend performance is "limited by bandwidth constraints," which appears to contradict the claim of scalability. Please clarify precisely which bandwidth is the bottleneck (DDR, on-chip buffer, etc.) and explain how this severe limitation impacts the practical scalability of the architecture for problems larger than those tested.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents IDEA-GP, an instruction-driven architecture tailored for geometric perception tasks in robotics, such as SLAM and SfM. The core contribution is not merely another hardware accelerator, but a holistic architectural paradigm that seeks to resolve the long-standing tension between efficiency and generality. The authors achieve this through a clever co-design of three key components: 1) a general-purpose Processing Element (PE) optimized for the fundamental mathematical operations of 3D spatial transformations; 2) a two-level Instruction Set Architecture (ISA) that allows high-level commands to be decoded on-chip into fine-grained PE operations; and 3) an online compiler that analyzes the workload of a given task in real-time to dynamically allocate PE resources between the "Frontend" (problem construction) and "Backend" (problem solving) stages of the optimization pipeline. This allows the architecture to adapt on-the-fly to different algorithms (e.g., VINS-Mono vs. OpenMVG) and varying environmental conditions, maximizing hardware utilization and performance.

Strengths

The primary strength of this work lies in its elegant conceptual model and its successful execution. It carves out a compelling middle ground between rigid, fixed-function accelerators and inefficient general-purpose processors.

Addressing a Fundamental Problem: The paper correctly identifies and addresses a critical challenge in robotic computing: the dynamic and heterogeneous nature of workloads. Geometric perception is not a monolithic kernel; the computational balance between constructing the Jacobian matrix (Frontend) and solving the resulting sparse linear system (Backend) varies significantly across algorithms and even within a single trajectory. The online workload allocation mechanism is a direct and effective solution to this problem, as demonstrated by the analysis in Figures 12, 13 (Page 10) and the performance results in Figure 14 (Page 11).
Excellent Domain-Specific Co-Design: The authors have done a superb job of abstracting the domain of geometric perception down to its essential computational primitives—manipulations of 3x3 rotation matrices and 3x1 translation vectors (Table 1, Page 5). By designing the PE around these specific operations, they achieve high efficiency without sacrificing the generality needed to construct a wide variety of residual functions and optimization problems. This is a textbook example of how a domain-specific architecture should be designed.
A Step Towards "Robotic Processors": This work fits beautifully into the broader, emerging narrative of creating "robotic processors." While some research focuses on a single aspect like planning or control, IDEA-GP provides a robust and flexible solution for the perception subsystem. Its instruction-driven nature makes it feel less like a fixed accelerator and more like a true co-processor. It moves beyond the paradigm of offline hardware regeneration (seen in works like Archytas [26] and ORIANNA [17]) to a truly dynamic, online-reconfigurable system, which is a significant step forward for practical deployment.
Demonstrated Generality and Performance: The evaluation is convincing. By testing on both the VINS family of SLAM algorithms and the more computationally distinct OpenMVG for SfM, the authors substantiate their claims of generality. The reported speedups are substantial and validate the architectural approach.

Weaknesses

The paper is strong, but its potential could be further highlighted by addressing a few points where the context is incomplete.

The Compiler's Oracle: The system's intelligence hinges on the compiler's ability to accurately predict workload. The paper mentions a "pre-built knowledge base" (Section 8, Page 9) for estimating computational costs. This component is crucial yet treated somewhat as a black box. The long-term viability and adaptability of IDEA-GP depend on how this knowledge base is created, maintained, and extended to novel algorithms or residual types not seen during its design. This is the main potential point of brittleness in an otherwise flexible system.
Positioning vs. Other Flexible Architectures: The paper positions itself well against fixed-function accelerators. However, it would benefit from a more explicit comparison to other flexible paradigms, such as Coarse-Grained Reconfigurable Arrays (CGRAs). A discussion explaining why IDEA-GP's domain-specific PE and ISA are more suitable for this problem domain than a generic spatial computing fabric would strengthen the authors' claims and better situate their contribution within the broader computer architecture landscape.
Scalability Bottlenecks: The authors rightly claim their architecture is scalable and acknowledge a bandwidth limitation in their 24-PE implementation (Section 9.3, Page 11). This is a critical point that deserves more exploration. As robotic perception moves towards higher resolution sensors and denser maps, the ability to scale to hundreds of PEs will become vital. A more in-depth analysis of the memory hierarchy and on-chip network would provide valuable insights into the true scalability limits and potential future improvements.

Questions to Address In Rebuttal

Could the authors elaborate on the construction and extensibility of the compiler's "pre-built knowledge base"? How would the system handle a completely new type of residual function introduced by a future algorithm? Is there an automated profiling process, or does it require manual characterization?
How would the authors contrast the efficiency of IDEA-GP with a hypothetical implementation on a state-of-the-art generic spatial accelerator (e.g., a CGRA)? What are the key advantages in terms of performance, area, and power that stem from the domain specialization of the PEs and ISA?
Regarding the bandwidth limitations noted for the Backend, what specific architectural modifications (e.g., deeper memory hierarchy, specialized data caches, different on-chip network topology) would be required to effectively scale the IDEA-GP architecture to a much larger number of PEs (e.g., 64 or 128)?
Looking forward, how do the authors envision IDEA-GP integrating into hybrid perception systems that are increasingly leveraging neural networks (e.g., for feature detection, data association, or as learned priors)? Could the PE array be adapted to support these workloads, or would it primarily function as a co-processor alongside a separate NN accelerator?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents IDEA-GP, a domain-specific architecture for accelerating geometric perception tasks, primarily SLAM and SfM. The core thesis is that these tasks suffer from two issues: (1) algorithmic diversity that makes fixed-function accelerators inefficient, and (2) dynamic workload variations between frontend (residual/Jacobian computation) and backend (sparse matrix solving) stages that lead to resource underutilization.

The authors' proposed novel contribution is a unified, instruction-driven architecture coupled with a compiler that performs online workload allocation. The architecture is built on an array of Processing Elements (PEs) designed for fundamental 3D pose mathematics. The compiler analyzes the incoming task, predicts the computational load of the frontend and backend, and generates instructions to dynamically partition the PE array between these two stages, thereby maintaining pipeline balance and maximizing throughput.

Strengths

The primary novel contribution of this work lies in its approach to runtime flexibility, which distinguishes it from prior art in hardware acceleration for robotics.

Shift from Design-Time Synthesis to Runtime Programming: The most significant point of novelty is the move away from hardware generation frameworks towards a runtime programmable architecture. Prior significant works like Archytas [26] and ORIANNA [17] address algorithmic diversity by re-synthesizing a new accelerator from a dataflow graph representation. IDEA-GP, in contrast, proposes a fixed (but scalable) hardware substrate that is programmed via an instruction stream. This fundamentally changes the flexibility model from a slow, offline process (re-synthesis) to a fast, online one (re-compilation of instructions). This is a meaningful and important delta.
Online Workload-Aware Resource Allocation: The explicit mechanism for online workload balancing is the key enabler of the aforementioned runtime flexibility. While the problem of frontend/backend imbalance is well-known, previous hardware solutions have typically committed to a fixed resource split. The concept of a compiler that models the computational cost (Eq. 6 & 7, page 9) and dynamically partitions hardware resources (Fig. 11, page 9) at runtime to solve this imbalance appears to be a novel contribution in this specific domain. The results in Figure 14 (page 11) directly validate the utility of this novel concept.
A Coherent Domain-Specific ISA: The authors have successfully abstracted the core computations of geometric perception into a set of primitive operations (Table 1, page 5) and a corresponding ISA (Table 2, page 7). While creating a domain-specific ISA is not a new idea in itself, its application here provides a clean interface between the software (compiler) and hardware (PEs) that is crucial for enabling the online allocation strategy.

Weaknesses

While the overall system concept has a strong novel element, the novelty of the constituent architectural components is less clear and could be better articulated.

Limited Novelty of the Core Architectural Primitives: The fundamental building blocks of the architecture are not, in themselves, novel. The PE design (Figure 4, page 5), a processing element optimized for 3x3 matrix and 3x1 vector operations, is a logical and well-established pattern for 3D graphics and robotics workloads. Similarly, the concept of a PE array executing in a dataflow or streaming manner is a foundational concept in spatial architectures. The paper's novelty rests almost entirely on the control and programming model applied to these standard components, not the components themselves.
Overlap with General-Purpose Spatial Architectures: The paper does not sufficiently differentiate its approach from general-purpose spatial computing or CGRA (Coarse-Grained Reconfigurable Array) frameworks. One could argue that the computational graphs of SLAM could be compiled onto a generic spatial accelerator (e.g., using a framework like DSAGEN [49]). The authors should more clearly articulate what specific, novel architectural features in IDEA-GP provide a critical advantage over such a general-purpose approach. Is it merely the domain-specific instruction set, or are there deeper microarchitectural optimizations that a general-purpose framework would miss?
Ambiguous Novelty of "On-Chip Instruction Generation": The paper highlights on-chip generation of basic instructions from high-level instructions (Section 6.2.2, page 7 and Figure 7, page 8). From an architectural standpoint, this strongly resembles a micro-coded control scheme, where a high-level instruction (e.g., gcal) is decoded into a sequence of micro-operations that steer the datapath. This is a classic technique in processor design. The paper needs to clarify what makes their specific implementation of this concept novel.

Questions to Address In Rebuttal

The primary distinction from prior art like ORIANNA [17] and Archytas [26] appears to be the move from a hardware synthesis framework to a runtime programmable architecture. Could the authors confirm this interpretation and elaborate on the specific hardware/software trade-offs (e.g., area overhead for instruction decoding vs. flexibility) that this new approach entails?
Beyond the domain-specific ISA, what are the key microarchitectural differences between IDEA-GP and a generic spatial accelerator or CGRA? If one were to compile the SLAM workload onto a state-of-the-art generic CGRA, where and why would it be less efficient than IDEA-GP?
Please clarify the novelty of the "on-chip instruction generation" mechanism. How does it differ conceptually from traditional microcode or a VLIW decoder that expands compact instructions into wider control words?
The Backend dataflow in Section 6.1 (page 6) breaks down the Schur complement computation into a five-stage process (pre, merge, geng, gcal, add). Is this decomposition itself a novel contribution, or is it an implementation of a known factorization algorithm mapped onto your architecture?

Avalanche: Optimizing Cache Utilization via Matrix Reordering for Sparse Matrix Multiplication Accelerator

Abstract

Sparse Matrix Multiplication (SpMM) is essential in various scientific and engineering applications but poses significant challenges due to irregular memory access patterns. Many hardware accelerators have been proposed to accelerate SpMM. However, they ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present Avalanche, a hardware accelerator for sparse matrix-matrix multiplication (SpMM) that uses an outer-product (OutP) dataflow. The central claim is that existing OutP accelerators that use tiling suffer from underutilized on-chip memory and excessive off-chip traffic for the B matrix. To address this, Avalanche introduces three techniques: 1) Mat-Reorder, which reorders the computation of matrix A's columns to improve temporal locality of intermediate products; 2) DP-Evict, which promptly evicts fully-computed ("dead") intermediate products; and 3) RM-Caching, which uses the memory freed by DP-Evict to cache reusable rows of matrix B. The authors claim their design achieves a 1.97x speedup over a state-of-the-art OutP accelerator.

Strengths

The paper is well-structured and clearly identifies a relevant problem in tiled OutP SpMM accelerators: the tension between storing intermediate products (matrix C) and caching input matrices (matrix B) in on-chip memory.
The motivation presented in Section 3 is backed by quantitative analysis (Figures 7, 9, 11), effectively demonstrating the baseline problems of B matrix traffic, memory underutilization, and the prevalence of dead products.
The inclusion of an ablation study (SimpleCache, Avalanche-MD, Avalanche-MDR) is commendable, as it attempts to isolate the performance contributions of the proposed techniques.

Weaknesses

My primary concerns with this work relate to methodological rigor, unsubstantiated claims, and potentially over-simplified design choices that may not generalize well.

Unvalidated Performance Model: The entire evaluation hinges on a "custom cycle-accurate simulator in C++" (Section 6.1). There is no information provided on how this simulator was validated. Without validation against an RTL implementation or a hardware prototype, the reported performance numbers (speedup, execution cycles) are speculative at best. Cycle-accurate modeling of memory subsystems and contention is notoriously difficult to get right.
Unaccounted Pre-processing Overhead: The Mat-Reorder technique requires reordering the columns of matrix A to create a "lower triangular form" and generating the Reorder Operand Descriptor (ROD) structure. This is a non-trivial pre-processing step whose latency is completely ignored in the evaluation. For a fair comparison, this overhead must be quantified and included in the total execution time, especially if the matrix structure is not known ahead of time. The claim of "< 4% of the total memory traffic" for ROD updates (Section 5.2) is presented without any supporting data or analysis.
Arbitrary Limitation of RM-Caching: The Reuse Distance-Aware Matrix Caching (RM-Caching) technique is a cornerstone of the design, yet it is arbitrarily limited to only caching B matrix elements with a reuse distance of exactly one (i.e., reuse in the very next tile). The justification provided—"Considering area overhead and complexity" (Section 4.3)—is insufficient. This design choice severely limits the technique's effectiveness for matrices where reuse patterns are more complex or spread out, a fact implicitly conceded by the data in Figure 15, where several matrices show low reuse at distance one.
Oversimplified Conflict Resolution and Replacement Policies:
- The Job Allocator's mechanism for avoiding RAW hazards ("job conflict") is primitive (Section 5.3). Simply skipping a job destined for a busy row index can lead to load imbalance and potential starvation of certain jobs, degrading performance. The paper does not analyze the frequency or performance impact of this avoidance strategy, particularly on the row-dominant matrices where performance was admittedly poor.
- The Unified Cache replacement policy (Section 5.5) relies on a simple 3-level priority system. In cases of contention within a priority level, the victim is chosen "randomly" (Section 6). Random replacement is not a robust strategy and can lead to unpredictable performance. The impact of this randomness is not evaluated.
Flawed Presentation of Motivation: The motivation in Section 3.1.2 uses Figure 8 to show memory underutilization. However, this figure includes a plot for "Live Intermediate Product after Reordering," which uses one of the paper's key proposed techniques (Mat-Reorder). The motivation for a problem should be demonstrated using the baseline state-of-the-art, not a baseline that has already been partially improved by the proposed solution. This conflates the problem with the solution.

Questions to Address In Rebuttal

The authors must address the following points with concrete data and evidence:

Simulator Validation: How exactly was the cycle-accurate simulator validated? Please provide evidence of its accuracy, for example, by comparing its output against RTL simulation results for key microbenchmarks.
Pre-processing Costs: What is the computational complexity and actual wall-clock time of the Mat-Reorder algorithm that generates the ROD bucket? Please provide data showing this pre-processing overhead as a percentage of the total execution time for the reported workloads.
Justification for Reuse Distance: Provide a quantitative analysis to justify limiting RM-Caching to a reuse distance of one. What is the performance impact of extending this to reuse distances of 2, 3, or more, and what is the corresponding hardware complexity cost?
Job Allocator Performance: Provide statistics from your simulation on the frequency of "job conflicts." How many cycles are lost due to the allocator skipping jobs? Show how performance degrades as the degree of row-dominance (and thus potential for conflict) in a matrix increases.
ROD Traffic Overhead: The manuscript claims that ROD write traffic is "< 4% of total memory traffic." Please provide a figure or table with a breakdown of this traffic across all benchmark matrices to substantiate this claim.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents "Avalanche," a hardware accelerator for Sparse Matrix Multiplication (SpMM) that focuses on optimizing on-chip memory utilization. The authors identify a critical trade-off in state-of-the-art outer-product (OutP) dataflows: matrix tiling, used to manage the intermediate product matrix (C) on-chip, severely degrades the reuse of the input matrix B, leading to excessive off-chip memory traffic.

The core contribution is an elegant three-part co-design to resolve this tension: 1. Matrix Reordering (Mat-Reorder): A pre-processing step that reorders the columns of matrix A to approximate a lower-triangular form. This is not for improving spatial locality in the traditional sense, but ingeniously for making the completion of computations for rows of matrix C predictable and sequential. 2. Dead-Product Early Eviction (DP-Evict): A hardware mechanism, enabled by Mat-Reorder, that identifies and promptly evicts fully computed "dead" rows of C from the on-chip cache. 3. Reuse Distance-Aware Matrix Caching (RM-Caching): A cache management policy that utilizes the on-chip memory space freed by DP-Evict to store frequently reused elements of matrix B.

By systematically finishing parts of the output matrix early to make room for caching the input matrix, Avalanche significantly reduces off-chip traffic. The evaluation demonstrates a 1.97x average speedup over a state-of-the-art OutP accelerator (SPAGHETTI).

Strengths

Excellent Problem Formulation and Insight: The paper does an outstanding job of identifying and quantifying a fundamental bottleneck in tiled SpMM accelerators. The analysis in Section 3 (page 4), particularly Figures 7, 8, and 9, provides a compelling motivation by clearly showing the explosion in memory traffic for matrix B due to tiling, alongside the chronic underutilization of the on-chip memory allocated for matrix C. This sets the stage perfectly for their solution.
Novel and Synergistic Core Idea: The central concept of reordering the computation specifically to simplify runtime resource management is both novel and powerful. It represents a sophisticated co-design that links a compile-time/pre-processing strategy directly to a runtime hardware policy. The three proposed techniques are not independent; they are deeply synergistic. Mat-Reorder makes the "death" of a product predictable, which enables the simple hardware for DP-Evict. DP-Evict, in turn, creates the opportunity that RM-Caching exploits. This holistic design is a significant strength.
Broad Contextualization and Potential for Generalization: The work is well-positioned within the landscape of SpMM dataflows (OutP, RoW, InP). More importantly, the authors demonstrate the versatility of their core ideas by applying a modified version to a row-wise (RoW) dataflow accelerator, yielding a notable 11% performance improvement (Section 6.7, page 12). This suggests that the underlying principle—reordering to create predictable data lifetimes—is a generalizable technique that could have an impact beyond this specific architecture.
Strong and Thorough Evaluation: The experimental methodology is robust. The authors compare Avalanche against relevant and recent academic accelerators (SPAGHETTI, GAMMA) and modern GPUs. The inclusion of an ablation study through the SimpleCache and Avalanche-MD variants effectively decomposes the performance gains, confirming that the full synergistic system (Avalanche-MDR) is necessary for the best results.

Weaknesses

While the core ideas are strong, the paper could be improved by deepening its connection to adjacent fields and discussing certain practical overheads.

Under-explored Connection to Graph Reordering Algorithms: The concept of matrix reordering to improve locality is a classic topic in scientific computing and graph processing. Algorithms like Reverse Cuthill-McKee (RCM) [22] are designed to reduce matrix bandwidth, which has a similar effect of clustering non-zeros near the diagonal. The paper's Mat-Reorder technique, which transforms the matrix into a "lower triangular form" (Section 4.1, page 6), seems conceptually related. A discussion, or even a brief comparative analysis, of how the structure produced by Mat-Reorder relates to that produced by RCM or other graph partitioning algorithms would significantly enrich the paper. Is Mat-Reorder reinventing a known technique for a new purpose, or is it fundamentally different? Clarifying this would better situate the work within the broader literature.
Overhead of the Reordering Pre-processing Step: The paper presents Mat-Reorder as a key enabler but does not analyze its pre-processing cost. The complexity and runtime of this reordering step are crucial for understanding the overall application performance. For applications involving a single SpMM, a costly reordering step could negate the benefits. For iterative methods (e.g., iterative solvers, graph algorithms) where the same matrix is used repeatedly, this cost can be amortized. The paper would be strengthened by a discussion of this overhead and the application contexts in which it is most practical.
Implications for Scalability and Parallelism: The reordering enforces a specific processing sequence of matrix A's columns to ensure predictable completion of C's rows. This might introduce dependencies that could, in theory, limit parallelism at a larger, system-wide scale (e.g., in a distributed memory setting or across multiple accelerators). The current evaluation is confined to a single accelerator. A brief discussion on the potential implications of this reordering on higher-level parallelism would be valuable.

Questions to Address In Rebuttal

Could you please quantify the pre-processing overhead of the Mat-Reorder technique, both in terms of computational complexity and measured time on representative matrices? In what application scenarios (e.g., single-shot SpMM vs. iterative methods) do you see this overhead being acceptable?
Can you elaborate on the relationship between your Mat-Reorder algorithm and established sparse matrix reordering algorithms like Reverse Cuthill-McKee (RCM)? Does your technique produce a similar matrix structure, and is the optimization goal—enabling predictable product completion—fundamentally different from RCM's goal of bandwidth reduction?
The Mat-Reorder technique imposes a new computational order. Have you considered the potential impact this might have on parallelism at scales larger than a single accelerator (e.g., in a multi-chip system)? Does this reordering create dependencies that might complicate partitioning the workload across multiple nodes?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present "Avalanche," a hardware accelerator for sparse matrix multiplication (SpMM) that uses an outer-product (OutP) dataflow. The central claim is a novel approach to optimizing on-chip memory utilization by coordinating three techniques: 1) Matrix Reordering (Mat-Reorder) to control the computation order, 2) Dead-Product Early Eviction (DP-Evict) to free up cache space from completed intermediate products, and 3) Reuse Distance-Aware Matrix Caching (RM-Caching) to use that freed space for caching elements of the input matrix B.

While the performance improvements are significant, the novelty of the core constituent ideas is limited. The contribution lies not in the invention of new primitives, but in the specific synthesis of pre-existing concepts to solve a known problem in tiled OutP accelerators—namely, the poor reuse of matrix B. The work is best characterized as a novel engineering integration of known techniques, rather than a fundamental breakthrough in SpMM acceleration.

Strengths

Novel Synthesis: The core conceptual link—using a specific computation reordering to create predictable "dead" products, which are then evicted to make room for caching reusable inputs—is a clever and coherent architectural strategy. The novelty is in this specific causal chain of optimizations.
Clear Problem Definition: The paper does an excellent job identifying a key weakness in the prior state-of-the-art OutP accelerator, SPAGHETTI [17]: matrix tiling solves the memory bloating of matrix C but introduces significant re-fetch traffic for matrix B (as shown in Figure 7, page 4). Avalanche presents a direct and effective solution to this specific, well-defined problem.

Weaknesses

The primary weakness of this paper from a novelty standpoint is the heavy reliance on established concepts and, in one case, a recently published hardware mechanism. The framing of the work as containing three distinct, novel techniques is an overstatement.

Limited Novelty of Mat-Reorder: The concept of reordering matrix computations to improve locality is a cornerstone of high-performance computing. More specifically, the hardware mechanism used to implement this reordering, the Reorder Operand Descriptor (ROD), is explicitly adopted from HARP [18], as stated in Section 4.1 (page 6) and Figure 13. HARP used this mechanism for "pseudo-tiling" to eliminate inefficient computations. Avalanche retargets this exact mechanism to improve the temporal locality of intermediate products. While the goal is different, the underlying mechanism is not new. The paper should more clearly position this as an application of the HARP mechanism rather than a novel reordering technique in its own right.
DP-Evict is a Consequence, Not an Invention: The concept of evicting data that is no longer needed (i.e., "dead") is fundamental to memory management. The challenge in SpMM has always been identifying when a product becomes dead due to irregular computation patterns. The authors' Mat-Reorder technique is what makes this identification tractable and predictable. Therefore, DP-Evict is not a novel technique in itself but rather a straightforward consequence enabled by the (borrowed) reordering mechanism. The novelty lies in making the problem simple enough to solve, not in the solution itself.
RM-Caching is a Simplification of Known Theory: Reuse distance is a well-understood concept in memory hierarchy analysis. The implementation here, which statically prioritizes caching matrix B elements with a reuse distance of exactly one (i.e., for the next adjacent tile), is a pragmatic but highly simplified application of this theory. It is not a novel caching policy but rather a targeted heuristic enabled by the space freed up by DP-Evict.

Questions to Address In Rebuttal

The authors should use the rebuttal to clarify the precise delta between their work and the prior art.

Regarding Mat-Reorder and HARP [18]: Please quantify the novelty of your Mat-Reorder technique beyond the contribution of HARP. Is the contribution solely the insight to re-target HARP's ROD mechanism to enforce a lower-triangular access pattern for improving temporal locality, or are there specific hardware or algorithmic modifications to the ROD mechanism itself?
Robustness of Reordering: The entire optimization cascade depends on the ability of Mat-Reorder to make intermediate products become dead in a predictable, sequential manner. How does the system's performance degrade for matrices that resist being reordered into a near-triangular form? What percentage of common sparse matrices exhibit this favorable property?
Complexity vs. Benefit Justification: The proposed architecture introduces non-trivial complexity: offline reordering to generate RODs, hardware to manage ROD reads/writes, a priority-based cache replacement policy, and a more complex job allocator to handle RAW hazards (Section 5.3, page 8). While the average speedup of 1.97x over SPAGHETTI is compelling, are there scenarios (e.g., matrices with low B-reuse, as identified in Figure 15, page 7) where this complexity provides only marginal benefits or even degrades performance?
Offline Pre-processing Cost: The Mat-Reorder technique requires an offline software step to analyze the matrix and generate the ROD buckets. What is the computational complexity and typical runtime of this pre-processing step? For applications where matrices are generated dynamically, this cost could be a significant practical limitation.

Debunking the CUDA Myth Towards GPU-based AI Systems: Evaluation of the Performance and Programmability of Intel's Gaudi NPU for AI Model Serving

Abstract

This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs, which is currently the de facto standard in AI system design. First, we create microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper presents a performance and programmability evaluation of the Intel Gaudi-2 NPU, positioning it as an alternative to the NVIDIA A100 GPU for AI model serving. The authors develop a suite of microbenchmarks to characterize primitive operations (compute, memory, communication) and evaluate two end-to-end workloads: recommendation models (DLRM) and large language models (LLM). The central thesis is that Gaudi-2 is a competitive alternative and that NVIDIA's dominance relies more on its software ecosystem than the CUDA programming model itself, a claim encapsulated by the provocative title, "Debunking the CUDA Myth."

While the paper provides a useful and detailed characterization of the Gaudi-2 architecture, its primary conclusions are undermined by significant methodological choices, internal inconsistencies, and unsupported claims. The work mistakes demonstrating Gaudi-2's competence in specific, favorable scenarios (e.g., large GEMMs) for a genuine challenge to the incumbent's deeply integrated hardware-software solution, especially in areas requiring fine-grained control and programmability.

Strengths

Comprehensive Microbenchmarking: The paper's most valuable contribution is the systematic, multi-faceted microbenchmark analysis in Section 3. The evaluation across GEMM, non-GEMM, memory access patterns, and collective communication provides a clear-eyed view of the Gaudi-2 processor's architectural characteristics.
Analysis of MME Reconfigurability: The reverse-engineering of the MME's configurable systolic array (Section 3.2, Figure 7) is insightful. The authors demonstrate convincingly how this feature improves compute utilization on irregularly-shaped GEMMs compared to a fixed-geometry systolic array. This is a solid piece of architectural analysis.
Highlighting System-Level Bottlenecks: The communication analysis (Section 3.4, Figure 10) correctly identifies the system-level interconnect (P2P RoCE vs. NVSwitch) as a key performance differentiator, appropriately separating the chip's capabilities from the server system's design.

Weaknesses

Misleading Title and Framing: The title "Debunking the CUDA Myth" establishes a premise the paper fails to support. The authors' own findings, particularly the struggles with programmability in the vLLM case study (Section 4.2), demonstrate the exact opposite: the tight coupling of the CUDA programming model with underlying hardware features like Tensor Cores (via WMMA APIs) constitutes a formidable advantage that the Gaudi software stack cannot replicate. The "myth" appears to be very much a reality.
Outdated and Convenient Comparison Point: The selection of the NVIDIA A100, a previous-generation accelerator, is a critical flaw. While justified on the basis of a shared process node (7nm), it is not the relevant competitor in the market at the time of publication. A comparison against the H100 would provide a far more honest assessment of Gaudi-2's competitiveness and would almost certainly show a much larger performance deficit, potentially invalidating the paper's core thesis. This choice appears engineered to make the Gaudi-2's performance seem more competitive than it is.
Major Contradiction in vLLM Performance Analysis: The vLLM case study (Section 4.2) contains a significant internal contradiction. The authors report that their optimized PagedAttention kernel on Gaudi-2 achieves only 45% of the A100's performance (Figure 17(c)). This is a catastrophic performance gap for the most critical component of transformer inference. Yet, they proceed to claim that the end-to-end performance is "similar" and "competitive" (Figure 17(d, e)), attributing this to Amdahl's Law and superior MLP performance. This is an extraordinary claim that lacks the necessary evidence. A detailed latency breakdown of the entire end-to-end inference process is required to substantiate how a >2x slowdown in the dominant kernel does not translate to a significant end-to-end slowdown. The current presentation is speculative and unconvincing.
Downplaying Critical Hardware and Programmability Limitations:
- The paper identifies Gaudi-2's 256-byte minimum memory access granularity as a weakness but understates its impact. In the DLRM case study (Section 4.1), this limitation results in performance of only 47% of A100 for small embedding vectors. This is not a minor issue; it is a fundamental architectural mismatch for an entire class of important sparse workloads.
- The authors admit that the core MME compute units are not directly programmable via TPC-C, contrasting sharply with CUDA's direct access to Tensor Cores. They frame their reliance on the "black-box" Gaudi graph compiler as a high-level programming paradigm, but for performance engineers, this is a severe limitation that prevents the implementation of novel, state-of-the-art kernels (e.g., a from-scratch FlashAttention).
Unsupported Attribution of Performance Gaps: In Key Takeaway #6 (Page 12), the authors claim the DLRM performance gap for small vectors "primarily stems from A100's superior hardware architecture rather than from the differences in the programming models." This is an unsubstantiated assertion. They provide no experiment or analysis to decouple these two factors. An equally plausible explanation is that the CUDA/FBGEMM software stack is simply better at scheduling memory operations and hiding latency for this access pattern—a direct function of the programming model and its compiler.

Questions to Address In Rebuttal

Please justify the "Debunking the CUDA Myth" framing. Given that your vLLM case study highlights a critical lack of low-level programmability for the MME and a resultant 55% performance deficit in the core PagedAttention kernel, in what precise way has the "myth" of CUDA's performant ecosystem been debunked?
Provide a clear and compelling rationale for benchmarking against the A100 instead of its contemporary, the H100. How would your central conclusions about Gaudi-2's competitiveness change if compared against the current state-of-the-art?
To resolve the contradiction in Section 4.2, provide a detailed profiling breakdown (e.g., a flame graph or latency-per-layer table) of an end-to-end LLM inference request on both Gaudi-2 and A100. This data must explicitly show the time spent in the attention kernels versus the MLP layers to validate your claim that the >2x slower attention performance is masked by faster MLP execution.
Regarding the conclusion in Key Takeaway #6, what specific analysis allows you to definitively attribute the poor performance on small embedding vectors to hardware alone, excluding the role of the TPC-C programming model, its compiler, or the runtime's ability to manage fine-grained memory requests?

Review 2

Excellent. I will now embody "The Synthesizer" and provide a peer review for the paper.

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a timely and comprehensive characterization of Intel's Gaudi-2 NPU, positioning it as a potential challenger to NVIDIA's long-standing dominance in the AI accelerator market. The authors draw a compelling parallel to the 2010 Intel paper that questioned the GPU's supposed 100x advantage over CPUs, framing this work as a spiritual successor that interrogates the "CUDA myth" in the modern AI era.

The work's core contribution is its multi-faceted evaluation methodology, which goes far beyond simple performance benchmarks. It synthesizes three critical dimensions:

Micro-architectural Performance: A deep dive into the raw capabilities of Gaudi-2's compute (MME, TPC), memory, and communication subsystems in comparison to the NVIDIA A100.
End-to-End Application Performance: An analysis of Gaudi-2's performance and energy efficiency on highly relevant, large-scale AI workloads—namely, Recommendation Systems (RecSys) and Large Language Models (LLMs).
Programmability and Software Optimization: Two insightful case studies on porting and optimizing state-of-the-art algorithms (FBGEMM for RecSys, vLLM for LLMs) to Gaudi, exploring the capabilities and limitations of its software stack at both the low-level (TPC-C) and high-level (PyTorch).

The authors conclude that while Gaudi-2's hardware is competitive and even superior in some aspects (e.g., GEMM-heavy LLMs), its ultimate success hinges on bridging the software and ecosystem gap. They argue that NVIDIA's true "moat" is less about CUDA as a language and more about the rich, mature software ecosystem built upon it, and that with sufficient investment in high-level framework integration, competitors like Intel can become viable alternatives.

Strengths

The primary strength of this paper is its holistic and contextual approach. It successfully elevates a hardware characterization study into a broader commentary on the dynamics of the AI systems landscape.

Excellent Framing and Historical Context: The explicit connection to the 2010 "Debunking the GPU Myth" paper (mentioned in the Introduction, page 2) is brilliant. It provides an immediate and powerful narrative frame, positioning Intel's current role as the "underdog" and lending historical weight to their investigation of the hardware-software power balance.
Comprehensive, Multi-level Analysis: The paper avoids the common pitfall of focusing only on peak FLOPS or a single workload. By connecting microbenchmark results (Section 3.2-3.4) to end-to-end application behavior (Section 3.5), the authors provide a much more nuanced and credible picture. For example, they identify Gaudi-2's weakness in small, random memory accesses (Key takeaway #3, page 9) and then demonstrate its real-world impact on embedding-heavy RecSys models (Figure 11, page 10).
Novel Focus on Programmability: The case studies in Section 4 are the most significant contribution of this work. While performance numbers are valuable, understanding the effort and methodology required to achieve that performance on a new architecture is arguably more important for the community. The vLLM case study (Section 4.2, page 12), in particular, is a masterful demonstration of the challenges of working with a less mature, more abstracted software stack. It highlights the critical role of the graph compiler and the limitations imposed by the lack of direct MME control, a crucial insight for both systems programmers and future hardware designers.
Connecting to the Broader Academic Landscape: This work sits at the intersection of several key research areas: computer architecture (DSA vs. GPU), systems software (vLLM, FBGEMM), and compilers (the role of MLIR and graph compilation). By evaluating a real, commercially significant system, it provides a grounded case study that is relevant to researchers across these domains. It serves as an excellent reference point for anyone looking to understand the practical challenges of moving high-performance computing beyond the NVIDIA ecosystem.

Weaknesses

The weaknesses of the paper are minor and largely inherent to the nature of such a rapidly evolving field, rather than fundamental flaws in the methodology or conclusions.

A Snapshot in a Moving Stream: The analysis is based on a specific version of the Intel Gaudi Software stack (v1.18.0). Given the immaturity of the ecosystem compared to CUDA, it is highly likely that many of the observed software limitations (e.g., the "black box" nature of the graph compiler, suboptimal library kernels) will be addressed in future releases. The paper acknowledges this implicitly, but it could be more explicit about how its conclusions might be altered by a more mature software stack.
Limited Scope of Competitors: The comparison is exclusively between Gaudi-2 and the A100. While this is a logical and well-justified choice (as explained on page 2), the AI accelerator landscape is diversifying with strong offerings from AMD (e.g., MI300) and other cloud providers' internal silicon. Acknowledging this broader context more directly in the discussion would strengthen the paper's panoramic view. The "Future Work" section (page 13) does a good job of this, but a brief mention earlier would be helpful.
The "Myth" Remains Partially Intact: The paper's title makes a bold claim. While it successfully argues that Gaudi's hardware is not the bottleneck, its own findings (especially in Section 4) show that the inability to easily program that hardware at a low level (a key feature of CUDA) remains a significant hurdle. One could argue the "CUDA myth" is less about the language itself and more about the philosophy of direct, flexible hardware control it represents, which Gaudi's stack currently abstracts away. The paper's conclusion touches on this, but the argument could be sharpened.

Questions to Address In Rebuttal

The vLLM optimization case study (Section 4.2) is fascinating. It shows how performance was recovered by restructuring the problem at the PyTorch level to better suit the Gaudi graph compiler. This seems to place a significant burden on the application developer to understand the compiler's preferences. How does this compare to emerging programming models like OpenAI's Triton, which also uses a high-level language (Python) but aims to give programmers more explicit control over performance-critical transformations? Could you speculate on whether a Triton-like interface for Gaudi would be a more effective path forward than improving the "black box" heuristics of the current graph compiler?
Your analysis shows that the Gaudi-2 hardware is highly competitive, yet the final performance is often dictated by the software stack. Given that software is a rapidly moving target, which of your key takeaways do you believe are most fundamental to the Gaudi architectural philosophy and are likely to remain true for future generations (e.g., Gaudi-3 and beyond), and which are more likely to be rendered obsolete by near-term software improvements?
Regarding the title, your work convincingly demonstrates that high-level frameworks can abstract away the need for a CUDA-like language for many AI practitioners. However, it also shows that for developers of cutting-edge libraries (like vLLM), the lack of low-level control is a major impediment. Does this suggest that the "CUDA myth" is not debunked, but rather bifurcated? That is, it's a "myth" for application users, but a "reality" for systems programmers and library developers who need to extract every last drop of performance from the hardware.

Review 3

Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents a comprehensive performance and programmability evaluation of the Intel Gaudi-2 NPU, positioning it as a competitor to the NVIDIA A100 GPU for AI model serving. The authors conduct this evaluation across three primary axes: (1) low-level microbenchmarks for compute, memory, and communication primitives; (2) end-to-end performance on representative AI workloads (RecSys and LLMs); and (3) two case studies on programmability, demonstrating low-level (TPC-C) and high-level (PyTorch) optimization strategies. The central thesis is that the Gaudi NPU architecture is competitive and that the dominance of NVIDIA's ecosystem is more attributable to its software maturity than an insurmountable hardware or programming model advantage.

My evaluation focuses exclusively on the novelty of the contributions. While the paper is well-executed, its primary contribution is one of characterization and insight, rather than the creation of a novel artifact or technique. The work's novelty rests on its claim to be the first academic study to perform this specific comparison with this level of depth, particularly regarding programmability.

Strengths

The primary strength of this paper, from a novelty perspective, lies in its synthesis and depth. While individual components of the evaluation methodology are not new, their application to the Gaudi-2 architecture in direct, rigorous comparison to the A100 provides novel insights that are not present in prior literature.

Novel Insights into Programmability (Section 4, page 11): The most novel aspect of this work is the detailed exploration of Gaudi's programmability. Prior works, such as Emani et al. [14] and Zhang et al. [83], focused on running LLMs and reporting performance. This paper goes a crucial step further by attempting to optimize key kernels from the ground up. The case study on implementing a BatchedTable operator in TPC-C for DLRM (Section 4.1) and optimizing PagedAttention at the PyTorch level for vLLM (Section 4.2) provides a genuinely new perspective on the practical challenges and limitations of the Gaudi software stack, such as the black-box nature of the MME. This moves beyond performance measurement to an analysis of the development ecosystem itself.
Quantification of Architectural Trade-offs: The paper uncovers and quantifies specific architectural behaviors that were previously only qualitatively described in technical reports or were unknown. For example, the reverse-engineering and analysis of the MME's reconfigurability (Figure 7, page 6) is a novel contribution that explains Gaudi-2's unexpectedly high utilization on irregular GEMM shapes. Similarly, quantifying the sharp performance degradation for random memory accesses smaller than its 256-byte minimum granularity (Key takeaway #3, page 9) is a concrete, new data point for the community.
Explicit Differentiation from Prior Art: The authors clearly position their work against existing studies [14, 83] in Section 6 (page 14), correctly identifying the gaps they fill: a broader workload analysis (including RecSys), a focus on energy efficiency, and a deep dive into programmability. This demonstrates a clear understanding of the existing landscape and carves out a well-defined novel contribution.

Weaknesses

The weaknesses of this paper are intrinsically linked to its nature as a characterization study. The core ideas behind the methods used are not, in themselves, novel.

Methodological Non-Novelty: The use of microbenchmarks to probe hardware capabilities is a standard and well-established technique in computer architecture (e.g., [36, 50]). Similarly, evaluating end-to-end application performance is the standard for system papers. The novelty here is entirely in the subject of the study (Gaudi-2) rather than the method of study.
Overlap with "Folklore" and Technical Documentation: Some of the findings, while rigorously confirmed here for the first time in an academic venue, may exist as "folklore" or can be inferred from Intel's own documentation. For instance, the recommendation to unroll TPC loops to hide pipeline latency (Section 2.2, page 5) is a known best practice from Intel's developer guides [27]. The paper's contribution is to measure the precise impact of this practice (Figure 8b, page 8), which is valuable but represents a confirmation rather than a discovery.
Incremental Nature of Insights: While the paper presents several new insights, one could argue that they are incremental additions to the body of knowledge rather than a paradigm shift. The core finding—that a specialized accelerator is competitive with a GPU on some workloads but faces software maturity challenges—is a recurring theme in the history of domain-specific architectures. The value is in the specific details for Gaudi, but the high-level narrative is not new.

Questions to Address In Rebuttal

On the Significance of MME Insights: Your analysis of the MME's dynamic reconfigurability (Section 3.2, page 6) is presented as a key finding. However, given that the MME is not directly programmable by users and is managed by a proprietary graph compiler, what is the actionable, novel takeaway for the broader research community beyond "Intel's compiler is effective"? How does this insight inform the design of future open hardware or compiler stacks?
Novelty of Programmability Challenges: The vLLM case study (Section 4.2, page 12) highlights the difficulty of optimizing attention due to the lack of low-level MME control, forcing optimizations to the PyTorch level. Is this insight novel in the sense that it reveals a fundamental, permanent design choice in the Gaudi programming model, or is it simply a snapshot of the current SDK's immaturity? The novelty of this finding is significantly diminished if a future SDK release simply exposes the required low-level controls.
Delta Over Prior Art: Beyond being broader and including energy analysis, what is the single most significant architectural insight presented in this paper that fundamentally changes the understanding of Gaudi NPUs compared to the picture painted by prior work like Emani et al. [14] and Zhang et al. [83]? Please be specific.
Assessing the "CUDA Myth": The title invokes the 2010 ISCA paper "Debunking the 100X GPU vs. CPU Myth". That paper showed that with proper optimization, the performance gap between CPUs and GPUs was much smaller than claimed. Your paper finds that Gaudi-2's performance is competitive, but its programmability is a significant hurdle requiring substantial optimization effort (e.g., achieving only 45% of A100's performance on PagedAttention even after optimization). Does your work truly "debunk" the CUDA myth, or does it in fact reinforce it by demonstrating how critical a mature software ecosystem and direct hardware control (as provided by CUDA for Tensor Cores) are to achieving peak performance?

GPUs All Grown-Up: Fully Device-Driven SpMV Using GPU Work Graphs

Abstract

Sparse matrix-vector multiplication (SpMV) is a key operation across high-performance computing, graph analytics, and many more applications. In these applications, the matrix characteristics, notably non-zero elements per row, can vary widely and impact ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the perspective of "The Guardian."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present a set of Sparse Matrix-Vector Multiplication (SpMV) implementations using a novel "Work Graphs" programming model on AMD GPUs. They claim this model, which enables fine-grained, on-device dynamic scheduling, significantly outperforms existing state-of-the-art library implementations from rocSPARSE. The primary claims are substantial performance speedups (up to 7.19x), drastically reduced memory overhead for auxiliary data structures, and simplified code complexity. While the reported performance numbers are impressive, the work's conclusions rest on a precarious methodological foundation due to the use of a non-standard, proprietary software and firmware stack, which introduces a critical confounding variable that undermines the central claims.

Strengths

Significant Reduction in Memory Overhead: The most compelling and defensible result is the drastic reduction in memory footprint for supporting data structures. Figure 7 clearly demonstrates that the Work Graphs approaches ("WG Fused," "WG Vanilla") have a small, fixed memory cost (~25-41 MiB), in stark contrast to rocSPARSE LRB ("RS Lrb"), whose memory requirement scales with the matrix dimensions to hundreds of megabytes. This is a clear and practical advantage.
Detailed Performance Analysis: The evaluation in Section 6 is thorough. The use of hardware performance counters (Figure 5) to investigate memory system utilization provides valuable low-level insight that supports the performance narrative (e.g., improved L0/L1 cache utilization). Furthermore, the breakdown of multi-iteration performance for "regular" versus "irregular" matrices (Figure 6b) shows a commendable level of detailed analysis.
Demonstration of a Potentially Powerful Model: The paper successfully demonstrates that a dataflow-driven, on-device scheduling model can, in principle, address the known inefficiencies of static, host-driven kernel launches for irregular workloads like SpMV. The concept of fusing preprocessing with computation for short rows in the "WG Fused" variant is a logical application of this model.

Weaknesses

Critical Methodological Flaw: Confounding Software Stacks: The entire comparison is fundamentally unsound. The authors compare their proposed method, built on a research compiler (AnyDSL) and a private, vendor-provided Platform Abstraction Library (PAL) with custom firmware (Section 5), against a baseline built on the public ROCm/HSA stack. It is impossible to isolate the performance impact of the Work Graphs model from the performance impact of the underlying software stack. The observed speedups could partially or wholly originate from lower dispatch overheads in PAL for this specific use case, different code generation from the AnyDSL compiler, or other unstated optimizations in the private software stack.
Insufficient Rebuttal of the Confounding Variable: The authors attempt to address the software stack discrepancy in Appendix A by porting the baseline LRB algorithm to their stack ("AD Lrb"). They show "AD Lrb" is slightly slower than "RS Lrb" and conclude their stack is at a disadvantage, thereby strengthening their claims. This reasoning is specious. It only shows their stack is not well-optimized for a sequence of traditional kernel launches. It is entirely plausible that the PAL stack is, conversely, highly optimized for launching a single, large, complex work graph, which is precisely what their proposed method does. The comparison remains apples-to-oranges.
Lack of Reproducibility: The work relies on custom AMD firmware, drivers, and libraries that are not publicly available (Section 5.1). As such, the results are completely irreproducible by the wider research community. This is a severe weakness for a paper submitted to an academic venue, as it prevents verification and extension of the work by others.
Failure to Isolate Performance Contributions: The "WG Fused" implementation, which provides the best results, conflates two distinct optimizations: (1) the dynamic scheduling of long rows via Work Graphs, and (2) kernel fusion for short rows processed directly within the binning node. The evaluation does not contain an ablation study to disentangle these effects. It is unclear how much of the 7.19x speedup comes from fusion (which can sometimes be implemented with other techniques) versus the novel scheduling capability itself.

Questions to Address In Rebuttal

How can the authors definitively prove that the reported performance gains are a direct result of the Work Graphs scheduling paradigm, and not an artifact of the fundamentally different and proprietary AnyDSL/PAL software stack having lower overheads for this workload pattern compared to the public ROCm/HSA stack?
The Appendix A comparison shows the AnyDSL/PAL stack is slower for a sequence of kernel launches. Could the authors provide microbenchmark data comparing the dispatch latency of a single, complex work graph via PAL versus the cumulative dispatch latency of a series of small kernels via HSA? This is critical to substantiating the claim that the stack is not the source of the speedup.
To properly attribute the performance gains of the "WG Fused" model, can the authors provide an ablation study? Specifically, a version must be evaluated that uses the Work Graph model to dispatch all rows to specialized downstream nodes (like "WG Vanilla") but uses the more efficient thread allocation logic of "WG Fused," thereby isolating the benefit of kernel fusion from the benefit of dynamic scheduling.
Given the complete reliance on a private and inaccessible software/firmware stack, what is the path forward for making this work reproducible? Can the authors commit to a timeline for a public release of the necessary components? Without a path to reproducibility, the work is more of a vendor whitepaper than a scientific contribution.

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a novel approach to sparse matrix-vector multiplication (SpMV) that leverages the emerging "GPU Work Graphs" programming model to achieve a fully device-driven execution. The core contribution is the demonstration that both the traditionally host-managed preprocessing (binning rows by length) and the subsequent parallel computation can be unified into a single, dynamic dataflow graph that executes entirely on the GPU. This eliminates host-CPU interaction and allows for fine-grained, on-the-fly scheduling of workgroups to specialized compute kernels, leading to interleaved execution that improves cache locality and hardware utilization.

The authors implement and evaluate several variants of this approach, culminating in a "fused" version that processes short rows directly within the binning node and only offloads longer, more complex rows to downstream nodes. Their results, benchmarked across 59 sparse matrices, show significant performance improvements of up to 7.19x (mean 3.35x) over the state-of-the-art rocSPARSE LRB implementation for single-iteration SpMV. Furthermore, their methods remain competitive or superior to the highly-optimized CSR-Adaptive algorithm for a realistic number of repeated multiplications. The work also highlights significant secondary benefits, including a drastically reduced and fixed memory footprint for metadata and lower implementation complexity.

Strengths

The primary strength of this paper is its outstanding contextual significance. It serves as a compelling and timely case study for a paradigm shift in GPU computing—from a host-orchestrated, bulk-synchronous model to a more autonomous, dataflow-driven one.

Demonstration of a New Execution Model: SpMV is a perfect "poster child" for the limitations of traditional GPU programming. By successfully applying GPU Work Graphs to this notoriously irregular problem, the authors provide the community with a clear and powerful demonstration of what this new hardware capability enables. This work bridges the gap between an abstract hardware feature and a concrete, high-impact application, effectively showing how "GPUs are growing up."
Excellent Problem Framing and Context: The background section (Section 2, page 2) effectively situates Work Graphs within the history of GPU scheduling, drawing clear distinctions between this new model and prior art like CUDA/HIP Graphs (coarse-grained, static) and Dynamic Parallelism (kernel-level granularity). This contextualization is crucial for understanding why Work Graphs represent a fundamental step forward.
Holistic and Convincing Evaluation: The empirical evaluation is thorough and persuasive.
- The performance gains are not merely incremental; they are substantial enough to command attention (Figure 4, page 10).
- The analysis thoughtfully covers both single-iteration (Section 6.1) and multi-iteration (Section 6.2) scenarios, acknowledging the amortization trade-offs that dominate real-world use cases.
- Crucially, the authors look beyond raw speed. The analysis of memory overhead (Figure 7, page 12) and implementation complexity (Table 3, page 12) paints a complete picture. The finding that the Work Graphs approach uses a fixed ~25 MiB of memory, versus a data-dependent footprint that scales to hundreds of megabytes for rocSPARSE LRB, is a powerful secondary result with practical implications for memory-constrained systems.
Enabling Future Research: This paper lays the groundwork for rethinking many other irregular algorithms on GPUs. The core concept—on-device classification and dynamic dispatch to specialized workers—is broadly applicable to graph analytics (e.g., processing nodes based on degree), particle methods (e.g., sorting interactions by distance), and adaptive mesh refinement. This work provides both the inspiration and a methodological template for future explorations in these areas.

Weaknesses

The weaknesses of this paper are largely related to the nascent state of the underlying technology rather than fundamental flaws in the research itself. However, they are important to acknowledge.

Limited Generality and Reproducibility: The work relies on a non-standard, vendor-specific toolchain (AnyDSL combined with a custom build of AMD's PAL) to access the Work Graphs feature. This is a significant barrier to reproducibility and adoption for the broader HPC community, which primarily operates within standardized frameworks like ROCm/HIP and CUDA. While the authors are transparent about this, it currently situates their work as a "proof of concept" rather than a readily deployable solution.
Single Architecture Study: The evaluation is performed on a single AMD RDNA3-based GPU. The remarkable performance gains are likely tied to the specific design of its Command Processor and its ability to manage the dataflow of records between nodes efficiently. It is unclear how these concepts and performance characteristics would translate to other architectures, such as NVIDIA's, which may have different hardware support for fine-grained, on-device scheduling (e.g., via Persistent Thread Blocks or other future mechanisms).
Potential Understatement of Broader Impact: While the paper is strong, its focus remains tightly on SpMV. It could benefit from a more explicit discussion of how the demonstrated principles could be generalized into a programming pattern or library abstraction for other irregular, data-driven problems. The paper proves that it works for SpMV but could do more to articulate how the pattern can be used elsewhere.

Questions to Address In Rebuttal

Path to Mainstream Adoption: Given the reliance on a custom toolchain, what do the authors see as the most viable path for integrating these device-driven techniques into mainstream HPC ecosystems? For instance, could the rocSPARSE library itself be evolved to use Work Graphs internally, abstracting the complexity from the user? Or would this require fundamental extensions to programming models like HIP, SYCL, or OpenCL?
Architectural Dependencies: Can the authors speculate on the key hardware features of the AMD Command Processor that are most critical to their success? For example, is it the size of the on-chip buffers for records, the latency of the scheduling logic, or the efficiency of the semaphore mechanism described in Section 2.5 (page 4)? Understanding this could help predict the viability of this approach on future or different hardware.
Generalizing the "Fused" Approach: The "fused" kernel (Section 4.2.3, page 7) is a clever optimization that processes simple work (short rows) immediately and only dispatches complex work. This pattern of "filter and dispatch" seems highly generalizable. Could the authors comment on its application to other domains? For example, in graph algorithms, could a "fused" node process low-degree vertices locally while dispatching high-degree vertices to parallel reduction kernels?
Comparison to a "Perfectly Tuned" General Kernel: The performance of RS General deteriorates on matrices with long rows due to thread divergence. Could a RS General-style kernel that is heavily optimized for a specific, narrow range of row lengths (e.g., 1-16) achieve similar performance to the "in-place" processing in the WG Fused binning node? This would help clarify if the performance gain comes from the fusion itself (eliminating record traffic) or simply from better specialization, which could theoretically be achieved in other ways.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents a novel implementation of sparse matrix-vector multiplication (SpMV) on GPUs by leveraging the emerging "GPU Work Graphs" programming model. The authors' central claim is that this new model, which enables fully device-driven, fine-grained, dataflow-based scheduling, can overcome the limitations of traditional host-driven, coarse-grained SpMV algorithms. They propose several variants, including a direct translation of the state-of-the-art Logarithmic Radix Binning (LRB) approach and a more advanced "fused" version that combines preprocessing and computation for short rows within a single graph node. The work demonstrates that this new approach significantly outperforms established rocSPARSE implementations in single- and multi-iteration scenarios, while simultaneously reducing memory overhead and implementation complexity.

The core contribution is the mapping of the SpMV problem onto this new hardware-supported, fine-grained dataflow execution paradigm. While the components—SpMV optimization techniques and dynamic scheduling concepts—are not new in isolation, their synthesis through the Work Graphs abstraction is novel and demonstrates a fundamentally different, and more effective, way to structure this irregular workload on a GPU.

Strengths

Clear Novelty in Application and Synthesis: The primary strength of this paper is its novelty. While the authors correctly attribute the Work Graphs model to prior industry development (Section 2.4, page 3), its application to a canonical HPC problem like SpMV is, to my knowledge, the first of its kind in the academic literature. The novelty lies not in inventing a new SpMV algorithm from scratch, but in demonstrating how to effectively map an existing state-of-the-art approach (LRB) to a completely new, hardware-managed, fine-grained dataflow execution model.
Significant Delta Over Prior Art: The paper does an excellent job of differentiating its contribution from prior attempts at dynamic GPU execution.
- vs. Persistent Threads (Section 2.2, page 3): This work replaces complex, manually-managed software queues with a hardware/firmware-managed dataflow mechanism. This is a significant step forward, abstracting away a major source of complexity and overhead.
- vs. CUDA/HIP Graphs (Section 2.1, page 2): The proposed method moves beyond static, coarse-grained (kernel-level) dependency graphs. Work Graphs operate at the workgroup/thread level, allowing for dynamic, data-dependent scheduling and interleaving of different processing stages, which is impossible with traditional task graphs.
- vs. Traditional On-Device LRB (Section 3, page 5): The standard LRB implementation (visualized in Listing 1) requires multiple distinct kernel launches and global synchronizations to pass data between the binning and processing stages. The authors' "Vanilla" and "Fused" Work Graph approaches (Listing 2) collapse this into a single graph launch where work items flow dynamically between nodes. This elimination of stop-the-world synchronization points and host interaction is a key conceptual advance.
The "Fused" Variant as an Algorithmic Novelty: The most novel contribution is the "Fused" implementation (Section 4.2.3, page 7). This is more than a simple translation of LRB; it is an algorithmic insight enabled by the Work Graphs model. By processing short rows directly within the binning node and only dispatching records for longer, more complex rows, the authors have effectively implemented a form of dynamic, on-the-fly kernel fusion. This demonstrates a sophisticated understanding of how to leverage the new programming model to co-locate dependent computations and improve data locality, which is a genuinely new technique in the context of SpMV.

Weaknesses

Novelty is Tied to a Nascent, Vendor-Specific Feature: The entire contribution hinges on the availability of the "Work Graphs" feature, which is currently not part of a standardized, cross-vendor compute API like OpenCL or CUDA (it is rooted in Direct3D 12). While the work is an excellent case study of the feature's potential, its novelty is tightly coupled to the adoption and standardization of this specific hardware capability. The paper could be strengthened by a brief discussion of the path to broader availability (e.g., through Vulkan or other APIs) to better frame the long-term relevance of this novel approach.
Potential for Hidden Complexity in Tuning: The "Fused" kernel introduces a new decision point: the threshold at which a row is no longer considered "short" and must be passed to a downstream node. This work appears to have pre-defined thresholds for "short," "vector-wave," and "long" rows. This risks trading one form of complexity (manual scheduling) for another (heuristic tuning). The novelty of the approach would be stronger if the authors discussed how these thresholds were determined and how sensitive the performance is to their values. Is it possible for the graph itself to dynamically adapt these thresholds? That would be a truly groundbreaking contribution.

Questions to Address In Rebuttal

The performance of the novel "Fused" approach (Section 4.2.3, page 7) depends on the classification of rows into "short" and "long." Could the authors elaborate on how the length thresholds for this classification were determined? Are these static, hard-coded values, or can they be tuned for different architectures or matrix characteristics? How much does performance degrade if suboptimal thresholds are chosen?
The paper's most significant conceptual leap is arguably the "Fused" kernel, not the direct "Vanilla" translation. Could the authors expand on why this type of dynamic kernel fusion is fundamentally infeasible or inefficient with prior art like persistent threads or CUDA Dynamic Parallelism? This would further sharpen the paper's core novel contribution.
Given that the novelty is tied to the Work Graphs feature, what is the authors' perspective on its potential for standardization across different GPU vendors and inclusion in more common HPC programming environments beyond the Direct3D-family of APIs?

Telos: A Dataflow Accelerator for Sparse Triangular Solver of Partial Differential Equations

Abstract

Partial Differential Equations (PDEs) serve as the backbone of numerous scientific problems. Their solutions often rely on numerical methods, which transform these equations into large, sparse systems of linear equations. These systems, solved with ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose Telos, a dataflow accelerator for Sparse Triangular Solve (SpTRSV) targeting structured sparse systems derived from Partial Differential Equations (PDEs). The core contributions are a "plane-parallel pipelining" technique based on an affine coordinate transformation to create parallelizable wavefronts, and a "cross-plane communication aggregation" method to manage data dependencies between processing elements (PEs). While the paper identifies a relevant problem and proposes a dataflow-based solution, its claims of generality and superior performance are not rigorously supported. The evaluation relies on questionable modeling of competing state-of-the-art accelerators, and the proposed techniques appear to be applicable only to a narrow, highly regular class of PDE problems, a limitation that is not sufficiently acknowledged.

Strengths

Problem Formulation: The paper correctly identifies the SpTRSV kernel as a major bottleneck in preconditioned iterative solvers for PDEs and rightly points out the challenges of parallelism and data locality on conventional architectures like GPUs (Section 3, Figure 4).
Core Pipelining Concept: The fundamental idea of transforming stencil dependencies into a pipelined dataflow execution across a PE array (Section 4.1) is a sound and promising direction for this problem class.
Data Reuse: The architecture is designed to maximize temporal and spatial data reuse, as analyzed in Section 5.3. This is a critical aspect for achieving high efficiency in memory-bound problems like SpTRSV.

Weaknesses

Unsupported Generality of the Affine Transformation: The central claim of the dataflow design rests on the affine transformation (Section 4.1.1, Algorithm 2) to eliminate backward dependencies. The paper demonstrates this for the Diamond-13P stencil and asserts it "can be applied to other stencil patterns." This is an unsubstantiated claim. The algorithm appears heuristic, and no formal argument or proof is provided to guarantee its success for any arbitrary stencil pattern arising from PDE discretizations. The work would be much stronger if the authors formally defined the class of stencils for which this transformation is guaranteed to work.
Flawed Evaluation of Competing Accelerators: The performance comparison against state-of-the-art accelerators Alrescha, FDMAX, and Spadix is fundamentally unsound. In Section 6.1, the authors state, "None of them has been open-sourced. Therefore, we model their behaviors and scale them..." This self-modeling of competing prior work is a critical methodological flaw. Without extensive validation, these models are unlikely to faithfully represent the performance of the original architectures. Consequently, the claimed speedups of 11x over Alrescha and ~8x over Spadix (Figure 12, Figure 15) are not credible.
Overstated Problem Scope: The title and introduction refer broadly to "Partial Differential Equations." However, the entire methodology—from the affine transformation to the PE mapping—is predicated on the extreme regularity of structured Cartesian grids. Many real-world scientific applications rely on unstructured meshes or adaptive mesh refinement (AMR), which generate sparsity patterns that would break the assumptions of Telos. The paper fails to discuss these limitations, implicitly overstating its applicability.
Misleading Use of "Systolic": The authors claim their communication aggregation technique enables "systolic-style data transfers" (Abstract, Section 8). However, the communication patterns depicted in Figure 7 are not strictly systolic. Systolic arrays are characterized by simple, regular, nearest-neighbor communication. The paths in Figure 7 (e.g., the aggregated path for x111) are complex, multi-hop, and data-dependent, requiring a more sophisticated routing network than what "systolic" implies. This terminology is misleading and exaggerates the regularity of the dataflow.
Insufficient Justification for Architectural Parameters: The choice of 4 vector lanes per PE (Section 6.4, Table 3) is poorly justified. The authors' own analysis in Figure 18c shows that for 3 out of 4 tested stencil patterns, performance saturates at 2 vector lanes. Only the most complex stencil (Box-27P) benefits from more lanes. Choosing 4 lanes for all PEs appears to be an inefficient use of silicon area and power for the majority of workloads presented. A more rigorous analysis of the area/performance trade-off is required.

Questions to Address In Rebuttal

Please provide a formal proof or, failing that, a rigorous argument for the correctness and completeness of the Backward Dependency Elimination (Algorithm 2). Specifically, define the exact class of stencil dependency graphs for which the algorithm is guaranteed to find a valid transformation. Are there known stencil types where it would fail?
Provide a detailed account of your methodology for modeling the Alrescha, FDMAX, and Spadix accelerators. How were these models validated? Without this, the quantitative comparisons against these key baselines cannot be considered reliable.
Please justify the use of the term "systolic-style" communication. Given that the communication paths are multi-hop and stencil-dependent, as shown in Figure 7, how does this align with the established definition of a systolic architecture?
How would the Telos architecture handle the irregularities introduced by common PDE solution techniques like unstructured grids or adaptive mesh refinement? If it cannot, please explicitly state this as a primary limitation of the proposed approach.
Explain the rationale for implementing 4 vector lanes per PE when your own data (Figure 18c) indicates that performance for most evaluated stencils does not improve beyond 2 lanes. What is the area and power overhead of the two unused lanes in these common cases, and how is this trade-off justified?

Review 2

Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Telos, a dataflow accelerator for the sparse triangular solve (SpTRSV) kernel, a critical bottleneck in preconditioned iterative solvers for Partial Differential Equations (PDEs). The work's core contribution is a novel methodology that leverages the structured sparsity inherent in stencil-based discretizations on regular grids. Instead of treating the sparse matrix as a generic graph, Telos translates the geometric dependency structure of the underlying stencil into a highly efficient, systolic-like dataflow execution pattern. This is achieved through two key techniques: a "plane-parallel pipelining" method, which uses affine transformations to systematically organize computations into parallel wavefronts, and a "cross-plane communication aggregation" scheme to manage the resulting data transfers between processing elements (PEs) with minimal overhead. The authors demonstrate through extensive experiments that Telos significantly outperforms CPUs, high-end GPUs, and a state-of-the-art general-purpose SpTRSV accelerator.

Strengths

Novel and Insightful Problem Framing: The paper carves out a compelling and important niche between two established classes of accelerators. On one hand, there are matrix-free stencil accelerators (like FDMAX), which are efficient but cannot handle the dependencies of SpTRSV. On the other hand, there are general SpTRSV accelerators (like Alrescha), which are flexible but fail to exploit the rich geometric structure of PDE-derived matrices. By focusing specifically on "structured SpTRSV," Telos provides a solution that combines the structural awareness of the former with the algorithmic power of the latter. This is a powerful synthesis of ideas from numerical methods and computer architecture.
Systematic and Elegant Dataflow Mapping: The use of affine transformations to resolve backward dependencies (Section 4.1.1, page 5) is the conceptual heart of this paper. It provides a principled and systematic method for converting a complex dependency graph into a regular wavefront computation that can be mapped directly onto a hardware PE array. This elevates the design from an ad-hoc solution for one stencil to a generalizable methodology, which is a significant strength. The resulting systolic-like communication pattern is a classic and highly efficient execution model that this work successfully applies to a notoriously difficult, dependency-laden problem.
Demonstrated Real-World Impact: The paper's evaluation is strong not only because the kernel-level speedups are impressive, but because it connects these gains back to the end-to-end application. By evaluating a full Preconditioned Conjugate Gradient (PCG) solver (Section 6.3, page 11), the authors show that accelerating SpTRSV enables the use of more powerful preconditioners. This leads to faster convergence and substantial overall time-to-solution improvements compared to accelerators focused on simpler, matrix-free methods. This clearly demonstrates that Telos is a solution to a real and significant problem in scientific computing.
Strong Algorithm-Architecture Co-design: The hardware architecture presented in Section 5 is clearly motivated by the dataflow design. Features like the non-zero value packing, the configurable aggregator primitive, and the Halo Exchange Units (HEUs) are not arbitrary but are direct consequences of the plane-parallel pipelining and communication aggregation schemes. This synergy between algorithm and architecture is a hallmark of excellent DSA design.

Weaknesses

Limited Scope (by design): The work's primary strength—its specialization for structured grids—is also its main limitation. Many cutting-edge simulations in fields like fluid dynamics and structural mechanics rely on unstructured grids or Adaptive Mesh Refinement (AMR) to efficiently model complex geometries. The paper would be strengthened by a brief discussion contextualizing its contribution, acknowledging this limitation, and perhaps speculating on the challenges or potential pathways to extend these geometric dataflow principles to more irregular problem domains.
Clarity on Generalizability: While the affine transformation is presented as a general method, its application could be further illuminated. For a researcher looking to apply this method to a new or more exotic stencil pattern, it is not immediately clear how the transformation matrix T is derived or whether the process in Algorithm 2 (page 6) is guaranteed to find a valid mapping. A more detailed example or a discussion on the potential for automating this transformation via a compiler would bolster the claim of a truly systematic methodology.

Questions to Address In Rebuttal

The affine transformation presented in Algorithm 2 is the cornerstone of your method. Could you please elaborate on its robustness and scope? Does this procedure provably find a valid parallelizable wavefront mapping for any stencil pattern that results in a solvable triangular system, or are there classes of stencils (e.g., those with very long-range or asymmetric dependencies) for which it might fail? Is this process currently automated for the stencils you tested?
Your work masterfully exploits the regularity of structured grids. From your perspective, what are the fundamental obstacles to applying the core principles of Telos (i.e., mapping local geometric dependencies to a systolic dataflow) to problems with irregular structures, such as those using unstructured finite element meshes? Is there a conceptual path forward, perhaps by operating on patches of structured sub-grids, or do you view this as a fundamentally structured-grid solution?
The Halo Exchange Units (HEUs) appear to handle inter-tile dependencies by writing updated boundary values back to main memory, which are then read in a subsequent execution phase (Section 5.3, page 9). This seems to introduce a memory round-trip that breaks the otherwise pure on-chip dataflow at tile boundaries. Could you quantify the performance impact of this design choice compared to a hypothetical on-chip-only exchange mechanism (e.g., the long FIFOs you mention and dismiss in Section 5.2, page 9)? A more concrete justification would help clarify the trade-offs involved in your inter-tile pipelining strategy.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes Telos, a dataflow accelerator for the Sparse Triangular Solve (SpTRSV) kernel, specifically targeting structured sparse matrices derived from Partial Differential Equation (PDE) discretizations. The authors identify that standard SpTRSV parallelization methods struggle to balance parallelism and data locality. The core of their proposed solution rests on two main ideas: (1) a "plane-parallel pipelining" technique, which uses an affine coordinate transformation to organize computations into parallel wavefronts, and (2) a "cross-plane communication aggregation" technique, which transforms the resulting data dependencies into a systolic, nearest-neighbor communication pattern. The paper presents a corresponding hardware architecture and demonstrates significant speedups over CPU, GPU, and a state-of-the-art SpTRSV accelerator.

Strengths

The primary strength of this work is the elegant and systematic synthesis of established parallel computing principles into a cohesive, specialized hardware architecture for a very specific and challenging problem domain. The authors have correctly identified that the structure inherent in PDE stencils is an underexploited resource for SpTRSV acceleration. The translation of a stencil's dependency pattern into a hardware-realizable systolic dataflow is well-conceived and thoroughly detailed from algorithm to microarchitecture.

Weaknesses

My evaluation is focused exclusively on the novelty of the core ideas. While the engineering and integration are commendable, the foundational concepts presented as novel in this paper have deep roots in prior art, specifically in the fields of parallelizing compilers and systolic array design.

"Plane-Parallel Pipelining" is Wavefront Parallelization via Loop Transformation: The central idea of partitioning the computation domain into hyperplanes (or wavefronts) where all nodes can be processed in parallel is a classic technique for parallelizing algorithms with uniform data dependencies, dating back to the late 1970s and early 1980s. This is the foundational principle behind systolic arrays and various loop parallelization strategies.
The Affine Transformation is a Known Polyhedral Compilation Technique: The method used to handle "backward dependencies" (Section 4.1.1, Page 5) is, in essence, a loop skewing transformation. The process of applying an affine transformation to a coordinate system to expose parallelism in a loop nest is the cornerstone of the polyhedral model of compilation. Decades of research in compilers have focused on automatically finding such transformations to enable parallel execution. The paper does not acknowledge this vast body of work, nor does it sufficiently differentiate its proposed transformation algorithm (Algorithm 2, Page 6) from standard polyhedral scheduling algorithms. The claim of novelty for this technique is therefore significantly overstated.
"Cross-Plane Communication Aggregation" is Systolic Data Routing: The concept of transforming longer-range data dependencies into a nearest-neighbor, pipelined communication fabric is the very definition of a systolic array, as pioneered by H.T. Kung. The proposed "aggregation" technique (Section 4.2, Page 6) is a specific implementation of this principle, where partial products are accumulated as they flow through the PE array. While Algorithm 3 (Page 7) provides a concrete method for assigning these paths, the underlying concept is not new. It is a direct application of systolic principles to the SpTRSV recurrence.

In summary, the novelty of this paper does not lie in the invention of new parallelization concepts, but rather in the application-specific specialization and hardware co-design of these well-known concepts for structured SpTRSV. The paper's framing, however, presents these foundational techniques as novel contributions in themselves, which is a major weakness. The true "delta" over prior art is the creation of a systematic methodology to map any PDE stencil pattern onto a fixed systolic hardware template, which is a valuable engineering contribution but not a fundamental conceptual breakthrough.

Questions to Address In Rebuttal

The authors must clarify the precise novelty of their work with respect to the extensive prior art in parallel algorithms and compilers.

Please clarify the novelty of the "plane-parallel pipelining" technique with respect to the decades of research on wavefront/hyperplane parallelization for algorithms with regular dependencies.
The affine transformation presented in Section 4.1.1 and Algorithm 2 appears to be a specialized application of techniques from the polyhedral compilation domain. Could the same result (i.e., a valid parallel schedule) be derived using an existing polyhedral tool like Pluto or ISL? If so, what is the novel contribution of your specific transformation algorithm beyond what is already established in the compiler literature?
The goal of creating systolic, nearest-neighbor communication is a well-established design pattern for dataflow accelerators. Is the novelty of the "cross-plane communication aggregation" primarily in the specific path-finding heuristic of Algorithm 3? Please elaborate on why this is a non-trivial contribution beyond a direct mapping of the SpTRSV dependency graph onto a 2D mesh topology.

MagiCache: A Virtual In-Cache Computing Engine

Abstract

The rise of data-parallel applications poses a significant challenge to the energy consumption of computing architectures. In-cache computation is a promising solution for achieving high parallelism and energy efficiency because it can eliminate data ...

ACM LINK

Reviews

Review 1

Here is a peer review of the paper from the perspective of "The Guardian."

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present MagiCache, an in-cache computing (IMC) architecture aiming to resolve two known issues in existing array-level IMC designs: static resource partitioning leading to cache under-utilization, and high latency from bursty data movement. The proposed solution has three main components: (1) a "cacheline-level" architecture where individual cache rows can be dynamically configured for either storage or computation; (2) a "virtual engine" to manage this dynamic allocation at runtime; and (3) an "instruction chaining" technique to overlap memory access and computation by allowing asynchronous execution across different cache arrays.

While the proposed techniques appear promising on the surface, the evaluation rests on several questionable assumptions and lacks the necessary rigor to fully substantiate its performance claims. The paper fails to adequately justify key design choices, provides insufficient evidence for critical assertions, and presents a performance comparison against a potentially weak baseline.

Strengths

The core motivation is well-founded. The paper correctly identifies the under-utilization of statically partitioned computing and storage space as a key limitation of prior art (Section 3.1). Figure 2 effectively demonstrates that a one-size-fits-all static partitioning is suboptimal across different applications.
The analysis of cache utilization for multi-application workloads (Section 6.2) is compelling. The data presented in Table 8 and visualized in Figure 11 provide clear evidence that the dynamic allocation scheme significantly improves effective cache capacity compared to a static 50/50 split, leading to lower miss rates for co-running scalar applications.

Weaknesses

Unsubstantiated Core Architectural Claims: The entire premise of dynamic allocation hinges on an efficient and low-overhead policy for converting cachelines to computing lines. The authors propose a Find-First-Available (FFA) policy and claim it "incurs less than 1% increase in the overall L2 miss rate" (Section 4.3, page 7). This is a critical claim, as a poor allocation policy could induce significant conflict misses, negating the benefits of increased capacity. However, there is no empirical data, table, or figure in the Evaluation section (Section 6) that isolates and validates this specific claim. It is presented as fact without evidence.
Insufficient Analysis of Instruction Chaining Limitations: The paper acknowledges that the instruction chaining technique is not universally applicable, citing that configuration, permutation (slide), and certain store instructions break the chain (Section 4.4, page 8). The results confirm this, showing minimal benefit for jacobi and pathfinder. However, the analysis stops there. A rigorous evaluation would require a characterization of how frequently such chain-breaking instructions occur across a wide range of vectorizable workloads. Without this, the generalizability of the instruction chaining benefit is unknown. The chosen benchmark suite may be biased towards applications with long, unbroken chains of compatible instructions.
Oversimplified and Potentially Unfair Baseline: The baseline, "SplitCache," implements a fixed 50/50 split of cache ways for computing and storage. While this represents a generic static scheme, the authors' own motivation in Figure 2 shows that different applications have different optimal static splits (e.g., 62.5% for matmul, 50% for backprop). A much stronger and more honest evaluation would compare MagiCache not just against a naive 50/50 split, but against an oracle static split for each benchmark. This would isolate the true benefit of dynamic runtime allocation from the benefit of simply choosing a better static configuration. The reported speedups may be artificially inflated by comparing against a suboptimal baseline configuration.
Incomplete Analysis of Performance Trade-offs: The results clearly show that as vector length increases (from Chain-1 to Chain-4), MSHR stall time also increases (Figure 9), and average MSHR usage rises significantly (Table 7). The authors dismiss this by stating that the reduction in synchronization time is greater. This conclusion, however, is contingent on the availability of sufficient MSHR resources. The evaluation is conducted with a fixed 32 MSHRs. The architecture's performance could collapse in a system with fewer MSHRs, a common configuration in more area-constrained designs. A sensitivity analysis to MSHR count is required to understand the robustness of these performance claims.

Questions to Address In Rebuttal

Please provide direct empirical evidence from your simulation framework to support the claim on page 7 that the FFA allocation policy "incurs less than 1% increase in the overall L2 miss rate" compared to standard LRU on a conventional cache of equivalent available capacity.
Please justify the selection of a fixed 50/50 split for the SplitCache baseline. How does MagiCache's performance compare to an oracle static split baseline, where SplitCache is configured with the optimal computing/storage ratio for each specific benchmark (as suggested by your analysis in Figure 2)?
Can the authors provide a quantitative analysis of the frequency of chain-breaking instructions (i.e., vsetvli, permutations, conflicting stores) for the evaluated benchmarks? Furthermore, can you comment on how prevalent these instructions are in broader vector workloads (e.g., SPEC, other HPC kernels) to better contextualize the general applicability of the instruction chaining technique?
The results in Figure 9 and Table 7 demonstrate a clear trade-off between reduced synchronization stalls and increased MSHR pressure. Please provide a sensitivity analysis showing how the speedup of MagiCache (e.g., Chain-4) and the baseline scale with a varying number of L2 MSHRs (e.g., 16, 32, 64).
Please clarify the physical organization of the L2 cache. The description on page 9 states "1024 sets and 8 ways, with each way including eight 256x256 fused arrays." This implies 1024 * 8 * 8 = 65,536 fused arrays, which seems inconsistent with the total L2 size and other architectural descriptions. Please provide a clear, consistent description of the mapping from ways/sets to the fused arrays.

Review 2

Excellent. This is a fascinating paper that aligns perfectly with the evolving landscape of domain-specific and near-data processing architectures. Here is my review from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces "MagiCache," a novel architecture for in-cache computing that addresses the critical problem of resource under-utilization inherent in current static, array-level designs. The core contribution is the introduction of a virtualization layer for in-cache compute resources. This is achieved through two primary innovations: 1) A fine-grained, cacheline-level architecture where individual cache rows can be dynamically configured as either storage cachelines or compute lines, creating "fused arrays." 2) A "virtual engine" that manages this dynamic allocation at runtime, mapping logical vector registers to physical cache rows based on application needs. This effectively transforms the cache from a statically partitioned resource into a flexible, unified pool of storage and compute. To mitigate the latency of bursty memory accesses that high-parallelism exposes, the authors also propose an "instruction chaining" technique that enables asynchronous execution across different physical arrays. The evaluation, conducted via gem5 simulation and circuit analysis, demonstrates significant speedups (1.19x-1.61x) and a dramatic improvement in cache utilization (42%) over a state-of-the-art static baseline.

Strengths

Conceptual Leap in Resource Management: The most significant strength of this work is its departure from the prevailing paradigm of static, coarse-grained resource partitioning in the in-cache computing space (e.g., EVE, Duality Cache). The introduction of a dynamic, cacheline-level management scheme is a conceptual leap. It applies the well-established principle of virtualization—decoupling the logical view of resources from their physical implementation—to the cache substrate itself. This is a powerful and elegant solution to the well-motivated problem of mismatch between fixed hardware configurations and variable application demands (as shown effectively in Figure 2, page 4).
Strong and Clear Motivation: The authors do an excellent job in Section 3 (page 3) of motivating their work. They clearly articulate the dual problems of cache space under-utilization and bursty data movement overhead. The analysis and diagrams (Figures 2 and 3) provide a compelling and intuitive justification for why a finer-grained, dynamic approach is necessary. This clear problem statement makes the proposed solution feel not just novel, but necessary for the advancement of the field.
Holistic and Well-Considered Design: The proposed MagiCache is not just a high-level idea; it is a well-considered architecture. The authors have thought through the necessary components, from the low-level hardware modifications (additional tag bits in Section 4.2), to the central management unit (the virtual engine and its VRMT in Section 4.3), to a crucial performance optimization (instruction chaining in Section 4.4). Furthermore, the brief but important discussion on OS integration (Section 4.6) shows a commendable level of system-level thinking, recognizing that such a profound change to the memory hierarchy requires co-design with system software.
Enabling Generality for In-Cache Computing: By allowing the trade-off between compute and storage to be made dynamically, MagiCache moves in-cache computing closer to a general-purpose paradigm. Previous architectures essentially carved out a fixed-size, specialized accelerator within the cache. MagiCache proposes a more fluid and adaptable substrate, which could significantly broaden the applicability and efficiency of in-cache computing for workloads with diverse or phase-changing behavior.

Weaknesses

While the core idea is compelling, the paper could benefit from a deeper exploration of the complexities introduced by this new layer of virtualization.

Management Overhead and Complexity: The "virtual engine" is the heart of the system, but its own performance impact and complexity are not fully explored. The paper quantifies the storage overhead of the VRMT, but not the latency overhead of the management logic itself. For instance, the FFA allocation policy requires a scan; while likely not on the critical path for a single instruction, the cumulative effect and potential for resource contention in the engine itself, especially with many concurrent vector register allocations/deallocations, is a potential concern.
Interaction with the Cache Coherence Protocol: Section 4.5 (page 8) touches on coherence but is quite brief. Dynamically changing the function of a cacheline from storage (and thus part of the coherence domain) to compute (managed by the virtual engine) seems fraught with complex corner cases. For example, what happens during the multi-cycle conversion process if an invalidation request arrives for that address? This dynamic interaction could introduce subtle race conditions or require complex locking mechanisms that are not discussed.
Scalability of the Virtual Register Mapping Table (VRMT): The VRMT's size is proportional to the number of architectural registers and the number of segments per register. As vector ISAs evolve towards even longer vectors (e.g., ARM SVE2, RISC-V V-extension with larger VLEN), this centralized table could become a bottleneck or a significant source of area/power overhead. The paper presents a viable design for the current RISC-V vector specification, but a discussion of its scalability to future architectural trends would strengthen the work.

Questions to Address In Rebuttal

Could you elaborate on the latency of the virtual engine's operations? Specifically, what is the cycle cost of a VRMT lookup and the find-first-available (FFA) allocation scan? How do you ensure that this management logic does not become a new bottleneck, particularly for applications with many short-lived vector operations?
Can you provide more detail on the handling of transient states during the conversion of a cacheline to a computing line? How does the coherence protocol interact with a line that is "in-flight" during this conversion? Are there additional states required in the coherence protocol to prevent races between scalar core requests and the virtual engine's management operations?
While the instruction chaining technique effectively hides latency between different fused arrays, what is the performance impact of workloads that exhibit poor locality within a single fused array's segment of a vector register? Does the bursty access problem simply reappear at a smaller granularity?
Thinking broadly, your work introduces a form of dynamic resource partitioning within the cache, a concept that shares philosophical similarities with reconfigurable computing. Could you comment on how MagiCache relates to this field and whether techniques from runtime reconfigurable hardware management could be applied to further enhance the virtual engine?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors propose MagiCache, an in-cache computing (IMC) engine that introduces a layer of virtualization over the physical cache hardware. The paper's contribution rests on three core claims of novelty: 1. Cacheline-Level Architecture: A novel hardware architecture where individual rows within a single SRAM array can be dynamically configured as either storage cachelines or compute lines, creating "fused arrays." This contrasts with prior art that partitions the cache at the coarser granularity of entire SRAM arrays. 2. Virtual Engine: A management layer that dynamically allocates and maps these cacheline-level resources to architectural vector registers at runtime. This allows for flexible vector lengths and high utilization, as unused vector register space can be reclaimed for general-purpose caching. 3. Instruction Chaining: A hardware technique that enables asynchronous execution of instructions across different fused arrays to overlap computation with memory access latency, specifically targeting the bursty access patterns of IMC workloads.

The central thesis is that this fine-grained, dynamic management of compute and storage resources within the cache hierarchy is more efficient than the static, array-level partitioning common in existing IMC architectures.

Strengths

The primary strength of this paper is the genuine novelty of its core architectural idea.

Novelty of Cacheline-Level Partitioning: The proposal to partition compute and storage resources at the cacheline level within a single SRAM array is a significant conceptual advance. To my knowledge, the dominant paradigm in prior work, such as EVE [3] and Duality Cache [15], involves dedicating entire SRAM arrays (or cache ways) to computation. This forces a static, coarse-grained trade-off between compute parallelism and cache capacity before runtime. MagiCache's "fused array" concept (Section 4.2, page 5), enabled by adding indicator bits to cache tags, dissolves this rigid boundary. This is a fundamentally new way to structure an in-cache computing substrate.
Novelty of the Virtual Engine's Dynamic Management: The virtual engine (Section 4.3, page 6) is a logical and novel consequence of the new architecture. While the idea of a mapping table is not new, its application to manage a pool of fungible cacheline-level compute/storage resources is. This contrasts sharply with the static mapping in EVE [3] and provides a much more flexible abstraction than the coarse-grained, power-of-two vector register grouping found in the RISC-V V extension [32] or ARM SVE [38], a point the authors correctly identify in their Related Work (Section 7, page 12). The lazy initialization scheme is a particularly clever implementation detail that directly addresses the register under-utilization problem.
Contextual Novelty of Instruction Chaining: While the term "chaining" evokes classic vector supercomputers, its implementation here is novel in context. Prior IMC work like Duality Cache [15] sought to overlap operations using a software-based VLIW approach, which places a heavy burden on the compiler. MagiCache proposes a hardware-driven, asynchronous execution flow (Section 4.4, page 7) that is more transparent to the compiler. The novelty lies in adapting the concept of chaining to this specific microarchitecture of distributed, independent "fused arrays" to hide memory latency.

Weaknesses

While the core ideas are novel, the paper could strengthen its claims by more deeply situating some of its secondary contributions against prior art and exploring the implications of its design choices.

Insufficient Differentiation of Instruction Chaining from Classic Vector Chaining: The paper claims instruction chaining as a contribution but does not provide a detailed comparison to the mechanisms used in seminal vector architectures (e.g., Cray-1). The core concept—forwarding a result from one functional unit to another before it is written back to the register file to start the next operation—is functionally analogous. The novelty here is in the implementation for a spatially distributed, cache-based accelerator. The authors should be more precise about what specific hardware structures make their implementation unique beyond the simple fact that it operates across fused arrays.
Potential Downsides of Fine-Grained Allocation are Unexplored: The paper convincingly argues for the benefits of dynamic, cacheline-level allocation (i.e., improved utilization). However, it does not discuss potential negative second-order effects. For example, could this fine-grained allocation lead to a form of "fragmentation" where the cachelines allocated to a single vector register are physically scattered across an array? This could potentially complicate the design of peripheral circuits (e.g., carry chains in the add layer) or introduce timing variations that are not accounted for. The novelty of the approach warrants a discussion of its potential physical design challenges.
The "Virtual" Moniker: The term "virtual engine" is used to describe the resource management layer. In computer architecture, "virtual" typically implies address translation (e.g., virtual memory). The mechanism here is more accurately described as dynamic resource mapping and indirection. While this may seem pedantic, for a paper whose contribution is a new abstraction layer, precision in terminology is paramount. The authors should clarify why they chose this term over others like a "Dynamic Resource Manager" or "Cacheline Mapping Engine."

Questions to Address In Rebuttal

Can the authors please elaborate on the specific hardware novelty of their instruction chaining technique compared to the chaining mechanisms implemented in classic vector processors? What are the key differences in the issue logic, dependency checking, and inter-array communication that make this a new contribution beyond a simple re-application of an old idea?
The novel cacheline-level allocation policy could lead to non-contiguous physical layouts for the segments of a single vector register within a fused array. Have the authors considered the potential performance or hardware complexity implications of this physical fragmentation? For instance, does it impact the performance of operations requiring communication across bit-lines, like wide additions?
Could you justify the choice of the term "virtual engine"? How does the proposed mechanism align with or differ from other established uses of "virtualization" in computer architecture, such as virtual memory or virtual machines?

NMP-PaK: Near-Memory Processing Acceleration of Scalable De Novo Genome Assembly

Abstract

De novoassembly enables investigations of unknown genomes, paving the way for personalized medicine and disease management. However, it faces immense computational challenges arising from the excessive data volumes and algorithmic complexity.While state-...

ACM LINK

Reviews

Review 1

Here is the peer review written from the perspective of 'The Guardian'.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose NMP-PaK, a hardware-software co-designed system for accelerating de novo genome assembly using near-memory processing (NMP). The work identifies the "Iterative Compaction" step in the state-of-the-art PaKman assembler as a memory-bound bottleneck. The proposed solution includes a channel-level NMP architecture with custom pipelined processing elements (PEs), a hybrid CPU-NMP execution model, and software-level batching to reduce the memory footprint. While the paper claims significant performance improvements (16x over CPU, 5.7x over GPU) and a 14x memory reduction, these claims are predicated on a flawed evaluation methodology, the use of unrepresentative baselines, and a fundamental compromise in the quality of the scientific result (genome assembly contiguity). The conflation of substantial software optimizations with hardware gains further obscures the true contribution of the proposed NMP architecture.

Strengths

The initial performance analysis of the PaKman algorithm in Section 3 is detailed, successfully identifying memory latency, bandwidth underutilization, and workload imbalance as key issues. The breakdown of runtime (Figure 4) and stall cycles (Figure 5) provides a clear, albeit limited, motivation for exploring architectural solutions.
The concept of a hardware-software co-design that explicitly targets the irregular data structures and access patterns of genome assembly is appropriate. The hybrid CPU-NMP model to handle outlier data (large MacroNodes) is a pragmatic, if not fully justified, design choice.

Weaknesses

Conflation of Software and Hardware Contributions: The paper's primary weakness is its failure to cleanly separate the performance gains from its novel NMP architecture versus its massive, preliminary software re-engineering. Section 4.5 describes optimizations to the baseline PaKman algorithm that yield up to a 416x speedup on k-mer counting and a 110x speedup overall. The paper then presents its main 16x NMP-PaK speedup relative to this already highly-optimized software baseline. This framing is misleading. The true, isolated contribution of the NMP hardware appears to be closer to ~6.2x (the ratio of NMP-PaK to CPU-PaK performance in Figure 11), a significant but far less dramatic figure than the one emphasized in the abstract and conclusion.
Unfair and Misleading Baselines: The comparisons presented are not rigorous.
- CPU Baseline: The "CPU baseline" is the authors' own modified and heavily optimized version of PaKman, not the original, published algorithm. Any comparison to a custom, non-standard baseline makes the results difficult to reproduce or place in the context of existing work.
- GPU Baseline: The GPU comparison is a classic strawman argument. The authors deliberately cripple the GPU by constraining the problem size to fit within a single GPU's memory (Section 6.6). This forces the use of a <4% batch size, which, according to their own data in Table 1, results in abysmal assembly quality (N50 of 1,209). A valid comparison would require a multi-GPU or out-of-core implementation that can handle the full problem without sacrificing scientific accuracy. Claiming a 5.7x speedup over a configuration that produces a useless result is scientifically unsound.
Compromised Scientific Output (Assembly Quality): The core premise of the single-node NMP-PaK system relies on batch processing to manage memory footprint (Section 4.4). This design choice fundamentally compromises the quality of the genome assembly. Table 1 clearly shows a degradation in contig quality (N50) as batch size decreases. The paper fails to provide the N50 value for the original distributed PaKman, which is the true state-of-the-art result they should be compared against. Without this crucial data point, the claim that NMP-PaK "conducts the same genome assembly" (Abstract) is unsubstantiated. It is highly probable that their N50 of 3,535 (for a 10% batch) is significantly worse than what is achievable with the original algorithm, meaning they have accelerated an inferior computation.
Arbitrary Design Choices and Insufficient Justification:
- The 1KB threshold for offloading large MacroNodes to the CPU (Section 4.3) is presented without rigorous justification. How was this value chosen? A sensitivity analysis is required to demonstrate that this is an optimal and robust choice, rather than a parameter fine-tuned for a single dataset.
- The paper glosses over the complexity of the synchronization required for the hybrid CPU-NMP model. It states the runtime system "guarantees" synchronization (Section 4.3), but provides no detail on the mechanism or its performance overhead, which could be substantial.
- The inter-DIMM network is critical, handling 87.5% of communication (Section 6.3), yet its design and simulation model are minimally described. This is a major component of the system, and its feasibility and performance characteristics are central to the paper's claims.
Questionable Simulation Fidelity: The performance results are derived from the Ramulator simulator with custom models for the PEs and network logic (Section 5.2). The paper states these are based on an "RTL design" but provides no details on this design, its verification, or the methodology used to derive cycle counts. Without this information, the accuracy of the simulation results cannot be verified.

Questions to Address In Rebuttal

Please provide a clear, end-to-end performance breakdown against the original, unmodified PaKman algorithm running on a single, powerful CPU node. Clearly delineate the speedup contributions from: (a) your parallelization of the original software (Section 4.5), (b) your process flow modifications (represented by CPU-PaK), and (c) the NMP hardware itself (the delta between CPU-PaK and NMP-PaK).
How can the GPU baseline be considered a fair comparison when it is forced into a configuration that produces a scientifically unacceptable result (N50 < 1,300)? Please provide a comparison against a state-of-the-art GPU assembly approach that can process the 10% dataset without this quality degradation, such as a multi-GPU implementation.
What is the N50 value and misassembly rate of the original, distributed PaKman assembler for the full 100% human genome dataset? How does this compare to the quality achieved by your NMP-PaK system using its batching strategy scaled to the full genome? Please justify the claim that you are performing the "same" assembly.
Provide a sensitivity analysis of the 1KB MacroNode size threshold for CPU offload. Show how overall system performance and the CPU-NMP workload balance are affected when this threshold is varied (e.g., 512B, 2KB, 4KB).
Please provide concrete details on the latency, bandwidth, and contention model used for the inter-DIMM network bridge in your simulation. Given that this network handles 87.5% of TransferNode communication, its model is critical to the credibility of your results. How was this model validated?

Review 2

Excellent. Taking on the persona of "The Synthesizer," I will provide a contextual analysis of the paper "NMP-PaK: Near-Memory Processing Acceleration of Scalable De Novo Genome Assembly."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

The authors present NMP-PaK, a hardware-software co-designed system for accelerating de novo genome assembly. The work's core contribution is not merely the application of near-memory processing (NMP) to genomics, but a deeply integrated solution tailored to a specific, critical bottleneck in a modern assembly algorithm. By identifying the "Iterative Compaction" step within the state-of-the-art PaKman assembler as the primary performance limiter (Section 3.2, page 4), the authors design a channel-level NMP architecture featuring pipelined systolic processing elements (PEs) to address its memory-bound nature and irregular access patterns. This hardware is complemented by a pragmatic software stack that uses batch processing to dramatically reduce the memory footprint and a hybrid CPU-NMP execution model to handle outlier data structures, thereby avoiding the costs of over-provisioning hardware for rare cases. The result is a system that claims significant improvements in performance, memory efficiency, and overall computational throughput compared to CPU, GPU, and even large-scale supercomputer-based approaches.

Strengths

This paper is an excellent example of architecture research driven by deep application-level analysis. Its primary strengths lie in its contextual awareness and the elegance of its co-designed solution.

Targeted Problem Identification: The authors have done a superb job of moving beyond a generic "genomics is slow" motivation. The detailed profiling in Section 3 (pages 3-5) that pinpoints Iterative Compaction and characterizes its stalls, memory bandwidth underutilization, and irregular data structures is the paper's analytical foundation. This allows them to design a solution for a concrete problem rather than a vague one.
Principled Architectural Choices: The choice of channel-level NMP is well-justified. The paper correctly argues that the size of the core data structure (MacroNode) and the need for inter-node communication make bank-level or pure in-situ PIM less suitable. The pipelined systolic PE design (Section 4.2, page 6) is a natural fit for the sequential dependencies within the compaction process for a single node, while the array of PEs provides the necessary node-level parallelism.
Pragmatic Co-Design: The most impressive aspect of this work is the synergy between the hardware and software. The hybrid CPU-NMP model (Section 4.3, page 7) is a particularly insightful design choice. It recognizes that the skewed distribution of MacroNode sizes (Figure 6, page 5) makes a one-size-fits-all hardware accelerator impractical. Offloading the few large nodes to the flexible CPU while the NMP engine handles the vast majority of smaller nodes is an elegant and efficient compromise. Similarly, the use of batch processing to trade a manageable amount of assembly quality (N50) for a massive (14x) reduction in memory footprint is a key system-level insight that makes the approach viable on realistic hardware.
Excellent Contextual Framing: The paper situates itself perfectly at the intersection of several important research trends: the push for domain-specific architectures, the rise of near-memory processing to overcome the memory wall, and the ever-present grand challenge of scalable genomic analysis. The comparison with a supercomputer in terms of throughput (Section 6.4, page 11) is particularly effective, reframing the goal from "raw speed" to "scientific throughput per unit of resource," which is a far more impactful metric for the target domain.

Weaknesses

The paper's strengths are considerable, but its tight focus on a specific algorithm raises broader questions about its long-term applicability and vision.

Dependence on the PaKman Algorithm: The entire NMP-PaK architecture, particularly the PE pipeline, is meticulously designed to accelerate the MacroNode data structure and the specific operations of PaKman's Iterative Compaction. While PaKman is a strong current baseline, the field of bioinformatics algorithms evolves rapidly. The paper could be strengthened by discussing the generalizability of its architectural principles. Are there fundamental graph operations (e.g., node merging, edge redirection) that are common to other De Bruijn Graph or even Overlap-Layout-Consensus assemblers that NMP-PaK could be adapted to? Without this discussion, the work risks being perceived as an over-specialized point solution.
A Siloed View of the Assembly Pipeline: The work focuses almost exclusively on the Iterative Compaction step. However, the authors' own analysis (Figure 4, page 4) shows that k-mer counting still constitutes 25% of the original runtime. A truly comprehensive system would need to address this. While the authors briefly mention a potential GPU integration for this step (Section 4.6, page 9), it feels like an afterthought. A more holistic discussion of the system-level implications—such as the data movement between a potential GPU k-mer counter and the NMP compaction engine—would present a more complete and compelling vision.
Under-Explored Scientific Implications of the Co-Design: The connection between batch size, memory footprint, and the resulting contig quality (N50 score, Table 1, page 8) is one of the most fascinating aspects of this work. It represents a direct bridge between an architectural constraint and a scientific outcome. This trade-off is presented but not deeply explored. A discussion of why smaller batches degrade quality—perhaps due to the fragmentation of the global graph structure—could lead to deeper insights into the fundamental memory and data locality requirements of genome assembly, informing future architectural designs.

Questions to Address In Rebuttal

Could the authors elaborate on the generalizability of the NMP-PaK architecture? Beyond the PaKman assembler, what features of the PE design or the inter-PE communication fabric would be applicable to other graph-based problems in genomics, such as pangenome graph analysis or long-read assembly algorithms that also use graph structures?
Considering the entire assembly workflow, what are the authors' thoughts on the system-level integration of NMP-PaK? If k-mer counting were offloaded to a GPU, what would the data flow look like, and would the data transfer between the GPU's HBM and the NMP-enabled main memory become a new bottleneck?
The trade-off between memory footprint (driven by batch size) and scientific quality (N50) is a critical finding. Can the authors provide more insight into the algorithmic or biological reason for this relationship? Does this imply a fundamental "working set size" for genomic information that must be co-resident in memory for high-quality assembly, and what would this mean for the design of future memory systems for this domain?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents NMP-PaK, a hardware-software co-designed system for accelerating de novo genome assembly. The authors correctly identify that the state-of-the-art algorithm, PaKman, is bottlenecked by its "Iterative Compaction" step, which is memory-bound and exhibits irregular data access patterns. The core proposal is to accelerate this specific step using a channel-level Near-Memory Processing (NMP) architecture. The claimed novel contributions are the synthesis of (1) a custom 3-stage pipelined systolic Processing Element (PE) tailored to PaKman's "MacroNode" data structure, (2) a hybrid CPU-NMP software strategy to offload the processing of irregularly large data structures, and (3) a batch processing methodology to manage the large memory footprint.

Strengths

The primary strength of this work lies in the novel application and deep specialization of known architectural concepts to a specific, well-motivated, and previously unaccelerated software bottleneck.

Novel Specialization of an NMP Accelerator: While applying NMP to bioinformatics is not a new concept in itself (e.g., Zhou et al. [57] on Megahit, Angizi et al. [3] on smaller genomes), this work is the first I have seen to design an NMP architecture specifically for the Iterative Compaction phase of a de Bruijn graph assembler. The focus on the PaKman algorithm and its unique "MacroNode" data structure is highly specific.
Custom PE Microarchitecture: The proposed 3-stage systolic PE (Figure 9, Page 7) is a genuinely novel piece of microarchitecture. Its pipeline stages—(P1) Invalidation Check, (P2) TransferNode Extraction, and (P3) Routing and Update—are directly derived from the semantics of the PaKman compaction algorithm. This is not a generic graph processor retrofitted for the task; it is a purpose-built datapath whose novelty comes from its tight coupling to the algorithm's logic.
Synthesis of Hybrid Processing for Irregularity: The hybrid CPU-NMP processing model is a clever solution to the classic problem of irregularity in data-parallel accelerators. While hybrid execution is a known pattern, its application here, motivated by the specific long-tail distribution of MacroNode sizes (Figure 6, Page 5), is a novel aspect of this system's co-design. It avoids the common pitfall of designing oversized PEs for worst-case scenarios, which is an important and practical contribution.

Weaknesses

The paper's claims of novelty could be sharpened by more clearly delineating between foundational concepts and their specific application. The core ideas are strong, but the surrounding narrative sometimes conflates standard engineering practice with novel contributions.

Constituent Concepts are Not Fundamentally New: The work builds upon several established concepts: channel-level NMP, systolic arrays for pipelined computation, hybrid CPU-accelerator models, and software batching to manage memory. The paper's novelty rests entirely on the synthesis and specialization of these ideas for this specific problem domain. The authors should be more precise in claiming this synthesis, rather than the individual components, as the core contribution.
Limited Novelty in Software Techniques: The batch processing approach (Section 4.4) is a standard technique for handling datasets that exceed main memory capacity. While its analysis in the context of contig quality (N50 metric, Table 1) is valuable, the technique itself lacks novelty. Similarly, the software refinements to PaKman described in Section 4.5 (e.g., using __gnu_parallel::sort) are good software engineering practices for creating a stronger baseline, but they are not conceptually new contributions tied to the NMP architecture itself.
The Interconnect is Presented as a Solved Problem: The paper relies on an inter-PE crossbar and inter-DIMM network bridges to handle the highly irregular communication patterns (87.5% inter-DIMM). While citing prior work like DIMM-Link [58] is appropriate, the paper does not sufficiently argue whether this existing solution is optimal or if any novel adaptations were required for the specific traffic patterns generated by TransferNode routing. The interconnect is critical to the system's performance, but its design is treated as a secondary detail.

Questions to Address In Rebuttal

The novelty of the PE design is its specialization. Can the authors further elaborate on the delta between their 3-stage pipeline and pipelines in prior work on general-purpose NMP graph accelerators? Specifically, what operations within the "Invalidation Check" and "TransferNode Extraction" stages are fundamentally unique to the "MacroNode" data structure and cannot be efficiently mapped to existing graph processing primitives?
The hybrid CPU-NMP model uses a 1 KB threshold to offload large MacroNodes. Is this threshold an empirically derived magic number, or is it based on a more fundamental analytical model of PE buffer capacity, NMP latency, and CPU offload overhead? How sensitive is the system's overall performance to this specific value?
How tightly coupled is the NMP-PaK design to the PaKman algorithm? Could the proposed PE architecture be generalized to accelerate compaction steps in other de Bruijn graph assemblers (e.g., Megahit, IDBA-UD) that also perform graph simplification, or is the reliance on the "prefix/suffix" structure of MacroNodes a fundamental limitation? A discussion on the generality of the novel hardware would strengthen the paper.

Reconfigurable Stream Network Architecture

Abstract

As AI systems grow increasingly specialized and complex, managing hardware heterogeneity becomes a pressing challenge. How can we efficiently coordinate and synchronize heterogeneous hardware resources to achieve high utilization? How can we minimize the ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper introduces the Reconfigurable Stream Network (RSN) Architecture, a novel ISA abstraction designed to manage heterogeneous hardware resources for DNN acceleration. The central concept is to model the datapath as a circuit-switched network of stateful functional units (FUs), where computation is programmed by triggering dataflow paths. The authors implement a proof-of-concept, RSN-XNN, on a VCK190 platform targeting Transformer models. They claim significant improvements in latency (6.1x) and throughput (2.4x-3.2x) over the state-of-the-art (SOTA) FPGA solution, and favorable performance and energy efficiency compared to contemporary GPUs like the T4 and A100.

Strengths

Well-Articulated Problem: The paper correctly identifies a critical challenge in modern heterogeneous systems: the friction and inefficiency in coordinating diverse compute resources (e.g., FPGA fabric and hardened AI Engines) and managing transitions between computational phases. The critique of coarse-grained, layer-level ISAs in current DNN overlays is valid and provides strong motivation.
Concrete Implementation: The authors go beyond simulation and implement their proposed architecture on a complex, real-world hardware platform (AMD Versal VCK190). This demonstration adds significant weight to their architectural concepts, as it forces them to confront practical implementation challenges.
Detailed Micro-optimizations: The discussion on fine-grained bandwidth mapping (Section 4.4, page 10) and the fusion of non-MM operators (Figure 11, page 10) demonstrates a deep understanding of performance optimization at the hardware level. These techniques are key contributors to the reported performance.

Weaknesses

My primary concerns revolve around the potential for exaggerated claims, the rigor of the comparative evaluation, and the unaddressed costs of the proposed abstraction.

Potentially Misleading and Overstated Performance Claims: The reported performance gains are exceptionally high and warrant intense scrutiny. In Figure 18, the authors highlight a "22x Latency" improvement over CHARM [119]. However, a careful reading of the text (Section 5.4, page 12) reveals this compares RSN-XNN at batch size 1 (5 ms) to CHARM's best latency at batch size 6 (110 ms). This is not a direct, apples-to-apples comparison. The more reasonable comparison at B=6 yields a 6.1x speedup, which, while still impressive, is significantly less than the highlighted 22x. Such presentation borders on misleading.
Insufficiently Rigorous GPU Comparison: The comparison to GPUs (Section 5.6, Table 10) contains several methodological weaknesses.
- The claim of "2.1x higher energy efficiency" over the A100 GPU is based on a power figure for the VCK190 derived from the Vivado power estimation summary (Figure 15, page 11). Vivado's estimates are notoriously optimistic and context-dependent; they are not a substitute for physical, on-board power measurement under load. Comparing this estimate against the well-characterized TDP or measured dynamic power of a production GPU is not a sound methodology.
- While RSN-XNN's lower DRAM usage is a strong point, the performance comparison does not fully account for the vast difference in software maturity. NVIDIA's performance is achieved through a highly optimized software stack (CUDA, cuDNN), whereas RSN-XNN relies on a bespoke, model-specific compiler and instruction stream. The claim of "matching latency" with the T4 GPU is therefore made across fundamentally different programming ecosystems.
Use of a Strawman Baseline: The motivational comparison in Figure 6 (page 6) contrasts RSN with a "Baseline datapath" using a simplistic vector ISA. This baseline appears to lack standard optimizations like double buffering, which the authors themselves note can be explicitly added. It is unclear if this baseline is representative of a state-of-the-art von Neumann-style overlay, or a simplified model constructed to amplify the benefits of RSN. The argument would be stronger if compared against a more robust, publicly available overlay design.
Unquantified Architectural Overheads and Generalizability: The paper claims RSN is a flexible, general abstraction, but the RSN-XNN implementation appears highly specialized for Transformer encoders.
- The paper quantifies the area of the instruction decoder (Table 5a, page 11) but fails to analyze the more significant overhead of the flexible streaming interconnect itself. What is the cost in terms of routing congestion, wire utilization, and potential impact on clock frequency for creating this reconfigurable network in the PL? These are first-order concerns for any FPGA design.
- The datapath in Figure 10 (page 8) is purpose-built for the target workload. It is not demonstrated how this specific arrangement of FUs would be reconfigured or how efficiently it would perform on a different DNN architecture, such as a modern CNN or a GNN, which have different dataflow patterns. The claim of generalizability is thus asserted but not sufficiently proven.

Questions to Address In Rebuttal

Please justify the "22x Latency" claim presented in Figure 18. Specifically, address the discrepancy in batch sizes (B=1 vs. B=6) between your work and the CHARM baseline in that comparison. Why is this a fair or informative comparison to present?
Can you provide stronger evidence for the 2.1x energy efficiency claim over the A100 GPU? A justification for relying on Vivado power estimates over physical measurements is required. If physical measurements are not available, please provide a sensitivity analysis or acknowledge this as a significant limitation.
Please defend the "Baseline Datapath" used in Figure 6. How does this baseline compare to existing, well-regarded RISC-like overlay designs? Does it include common optimizations like software-managed double buffering that could mitigate the WAR hazard stalls you highlight?
Beyond the instruction decoder, what is the hardware cost (in terms of LUTs, FFs, and routing resources) of the reconfigurable streaming interconnect fabric in the PL? How did implementing this flexibility impact timing closure and the final achievable frequency (260 MHz)?
The RSN-XNN datapath is demonstrated on Transformer models. What specific architectural changes would be required to efficiently map a fundamentally different architecture, like ResNet-50 or a Graph Convolutional Network, onto the RSN-XNN hardware? Please comment on the required effort and the expected efficiency.

Review 2

Excellent. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents the Reconfigurable Stream Network (RSN), a novel ISA-level abstraction for programming heterogeneous accelerators, with a specific focus on the domain of Deep Neural Networks (DNNs). The core idea is to move away from traditional coarse-grained, von Neumann-style overlays and instead model the hardware datapath as a circuit-switched network of stateful functional units (FUs). In this model, computation is programmed by establishing and triggering dataflow paths between FUs. This abstraction provides a unified framework for orchestrating spatially and temporally diverse hardware resources, such as the mix of programmable logic and hardened AI Engines (AIEs) found on modern platforms like the AMD Versal VCK190.

The authors implement a proof-of-concept system, RSN-XNN, targeting transformer models on the VCK190. Their evaluation demonstrates significant performance improvements over a state-of-the-art academic solution on the same platform (6.1x lower latency on BERT) and shows competitive energy efficiency compared to contemporary GPUs like the NVIDIA A100.

Strengths

The primary strength of this paper is its central, elegant abstraction. It successfully reframes the problem of heterogeneous acceleration, offering a conceptually clean solution to a notoriously messy challenge.

A Unifying and Natural Abstraction: The RSN concept is an excellent fit for both the target application domain (DNNs) and the target hardware architecture (heterogeneous FPGAs). DNNs are fundamentally dataflow graphs, and RSN exposes this structure directly at the ISA level. This is a significant conceptual leap from prior overlays that attempt to impose a sequential, instruction-by-instruction execution model onto inherently spatial hardware. By abstracting hardware as a network of FUs, RSN provides a natural way to manage the "impedance mismatch" between different components, like the coarse-grained FPGA fabric and the fine-grained AIE array, as discussed in the introduction (Section 1, page 1).
Addressing a Critical and Timely Problem: With the rise of multi-die systems and heterogeneous integration (e.g., AMD's ACAPs, Intel's FPGAs with hardened blocks), the problem of efficient resource orchestration has become a primary bottleneck. This paper correctly identifies the limitations of existing approaches—namely, the high stalls associated with coarse-grained control and the difficulty of fine-grained coordination. RSN offers a compelling architectural paradigm to address this exact challenge.
Strong Empirical Validation: The authors go beyond a purely conceptual proposal and provide a robust implementation and evaluation. The RSN-XNN case study on a real and complex hardware platform (VCK190) is convincing. The detailed analysis of execution strategies, such as the dynamic pipelining of layers (Figure 7, page 7) and the fine-grained interleaving of memory operations (Figure 12, page 10), provides concrete evidence for the flexibility and power of the RSN model. The performance results are impressive and clearly demonstrate the value of enabling precise, software-controlled data movement and compute-communication overlap.
Excellent Contextualization within the Literature: The paper does a fine job of positioning its contribution within the broader landscape. It effectively connects to and distinguishes itself from several related fields:
- FPGA Overlays: It clearly articulates why its network-based ISA is superior to the prevailing VLIW-like or RISC-like overlays (Section 2.3, page 4).
- Dataflow Architectures and CGRAs: It acknowledges its roots in dataflow computing but correctly points out the difference in scale and heterogeneity of the FUs in the DNN domain (Section 2.5, page 4).
- ASIC Accelerators: It draws parallels to the deterministic, software-managed communication in ASICs like Groq and SambaNova, effectively arguing that RSN brings this class of dataflow flexibility to a reconfigurable substrate.

Weaknesses

While the core idea is strong, the paper's main weaknesses lie in the questions it leaves unanswered about the practical realization and generalization of the RSN vision.

The Compiler Challenge is Understated: The paper presents a Python-based library, RSNlib, for generating RSN instructions (Figure 13, page 10). This library appears to require significant expert knowledge to manually specify scheduling decisions, such as linking auxiliary operations and overlapping prolog/epilog phases. While this is acceptable for a proof-of-concept, the true promise of RSN hinges on a sophisticated compiler that can automate this complex spatial mapping and scheduling problem. The paper acknowledges this as future work, but the difficulty of this task is non-trivial and represents the single largest barrier to the broad adoption of this architecture. Without such a compiler, RSN remains a powerful tool for hardware experts, not a general-purpose programming model.
Generality of the Abstraction: The RSN model is developed and demonstrated in the context of DNNs, particularly transformers. This is a domain with highly regular, predictable, and streaming-friendly computation. It is less clear how the "triggered path" model would handle applications with more dynamic, data-dependent control flow or irregular memory access patterns. A deeper discussion on the architectural features required to extend RSN to other domains (e.g., graph analytics, scientific computing) would help to better define the boundaries of its applicability.
Potential Overheads of Flexibility: The work rightly celebrates the flexibility RSN enables. However, there is an implicit cost to this flexibility. For instance, making every FU a stallable, stream-aware network node with a standardized interface may introduce area, power, or timing overheads compared to a more specialized, tightly-coupled, fixed-function pipeline. While the instruction decoder overhead is shown to be small (Table 5, page 11), a more thorough analysis of the overhead within the FUs themselves would provide a more complete picture.

Questions to Address In Rebuttal

On the Path to Automation: Could the authors elaborate on the key challenges and their proposed direction for building a compiler that can automatically map high-level graph representations (e.g., from PyTorch or ONNX) to the RSN ISA? Specifically, how would such a compiler reason about resource allocation, path scheduling, and fine-grained memory interleaving to achieve the performance shown in the paper without manual directives?
On the Boundaries of the Model: What are the fundamental characteristics of a computational problem that make it well-suited for the RSN model? Conversely, could you provide an example of an application domain where RSN would be a poor fit, and explain what architectural modifications might be needed to support it?
On Designing for RSN: Could you comment on the microarchitectural implications of designing an "RSN-compliant" FU? What are the key requirements for the stream interfaces (e.g., regarding buffering, backpressure signaling), and how does this impact the design complexity of the FUs compared to traditional approaches?

Review 3

Of course. Here is a peer review of the paper from the perspective of 'The Innovator'.

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes the Reconfigurable Stream Network Architecture (RSN), an ISA abstraction designed to manage and orchestrate computation on modern heterogeneous platforms, specifically the AMD Versal ACAP which combines FPGA fabric and hardened AI Engines (AIEs). The authors claim novelty in this ISA abstraction, which models the datapath as a circuit-switched network of stateful functional units (FUs) programmed by triggering data paths. The core idea is to unify resource management and reduce phase-transition stalls through a stream-based, spatially-explicit programming model.

While the authors frame this as a novel architecture, the core conceptual underpinnings are deeply rooted in decades of research into dataflow computing, stream processing, and Coarse-Grained Reconfigurable Architectures (CGRAs). The execution model, which decouples control and data planes and uses latency-insensitive stream communication, is functionally equivalent to prior paradigms. The primary novelty of this work is not the architectural abstraction itself, but rather its specific and effective synthesis and application to a modern, highly heterogeneous commercial platform. The impressive results demonstrate excellent engineering, but the foundational ideas are not new.

Strengths

Effective Synthesis of Known Concepts: The authors have successfully synthesized well-established principles from dataflow, streaming, and decoupled execution into a coherent and high-performing system for a modern, challenging hardware target. The engineering effort to make these ideas work on the Versal platform is significant.
Strong Empirical Results: The performance improvements reported over the SOTA solution (CHARM [119]) on the same VCK190 platform are substantial (e.g., 6.1x latency reduction for BERT). This demonstrates that the authors' implementation of the chosen paradigm is highly effective and justifies the complexity.
Well-Articulated Problem: The paper correctly identifies a critical and timely challenge in modern computer architecture: the difficulty of orchestrating diverse compute resources (FPGA logic vs. hardened processor arrays) and minimizing the overhead of transitioning between computational phases.

Weaknesses

Overstated Conceptual Novelty: The central weakness of this paper is the overstatement of its conceptual novelty. The "Reconfigurable Stream Network" is a new name for a very old idea. The concept of modeling computation as a network of processing nodes communicating via streams is the foundation of dataflow architectures and stream processors.
Insufficient Differentiation from Prior Art: The paper's own "Related Work" section (Section 6, page 14) contains the evidence against its primary claim of novelty.
- Stream Processors [30, 52, 53] & Streaming Dataflow [73, 87]: These works established the paradigm of using kernels operating on data streams. RSN's programming model is a direct descendant. The authors do not sufficiently explain how their abstraction is fundamentally different from these.
- DySER [36]: This work is described as integrating a "circuit-switched network of stateless FUs into the execution stage of a processor pipeline." RSN is described as a "circuit-switched network with stateful functional units" (Abstract, page 1). The delta between "stateless" and "stateful" FUs is incremental, as many dataflow nodes are inherently stateful (e.g., an accumulator). This is not a fundamental architectural innovation.
- Triggered Instructions [82]: This paradigm "removes the program counter and integrates the architectural state registers into the FUs." RSN's execution model, where FUs react to incoming uOPs and stream data, is conceptually identical to a triggered instruction model where nodes are activated by the arrival of data or control tokens. The paper fails to articulate a clear distinction.
Mischaracterization of CGRA Limitations: The paper attempts to differentiate RSN from CGRAs by claiming that CGRAs have "limited support for coarse-grained heterogeneity" (Section 2.5, page 5). This is a generalization that ignores a significant body of research on heterogeneous CGRAs (e.g., [11, 14, 35]). The core principle of virtualizing heterogeneous FUs is common to both fields. The novelty here is the scale of heterogeneity (AIE vs. fabric), not the principle of abstraction.

The contribution would be more accurately described as a specific, high-performance ISA design and implementation of a dataflow-style overlay for the Versal ACAP, rather than a novel architectural paradigm.

Questions to Address In Rebuttal

Please clarify the fundamental conceptual difference between the RSN execution model and the one proposed in DySER [36], which also uses a circuit-switched network of FUs. What makes the addition of state to the FUs a novel architectural contribution rather than an implementation detail required by the target algorithms?
The "Triggered Instructions" paradigm [82] proposes a control mechanism where FUs are activated by data/token arrival in a spatially-programmed architecture. How does the RSN's control plane, which issues uOPs to FUs to initiate kernel execution on streams, fundamentally differ from this established concept?
The paper claims that a "network abstraction at the ISA level" is the authors' key insight (Abstract, page 1). Seminal stream processors like Imagine [52] already exposed a stream-based programming model and ISA. What specific aspect of the RSN's "network abstraction" is novel compared to the stream abstractions in this prior art?
Given the extensive and closely related prior work, could the authors re-characterize their primary novel contribution? Is the contribution the architectural abstraction itself, or is it the specific, co-designed ISA, multi-level decoder (Section 3.3, page 7), and FU library that makes this known abstraction class highly efficient for DNNs on the novel Versal platform?

DS-TPU: Dynamical System for on-Device Lifelong Graph Learning with Nonlinear Node Interaction

Abstract

Graph learning on dynamical systemshas recently surfaced as an emerging research domain. By leveraging a novel electronic Dynamical System (DS), various graph learning challenges have been effectively tackled through a rapid, spontaneous natural ...

ACM LINK

Reviews

Review 1

Of course. Here is the peer review from the perspective of 'The Guardian'.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose DS-TPU, a dynamical-system-based analog hardware accelerator for graph learning. The work introduces two main contributions: 1) an on-device training mechanism, termed "lifelong learning," which uses a feedback electric current as a physical analog for the loss function (EC-loss), and 2) a method for modeling nonlinear node interactions using Chebyshev polynomials. The authors claim that this algorithm-architecture co-design results in orders-of-magnitude improvements in training and inference speed and energy efficiency over SOTA GNNs and hardware accelerators, alongside superior accuracy.

While the paper presents a conceptually novel approach to unifying training and inference in an analog substrate, its claims of extreme performance and accuracy improvements rest on a series of questionable methodological choices and inadequately justified assumptions. The evaluation framework, particularly concerning baseline comparisons and the nature of the simulation, lacks the rigor necessary to substantiate the headline claims.

Strengths

Novel On-Device Training Mechanism: The core concept of formulating the loss function as a physical, feedback electric current (EC-loss, Section 3.2.1, page 4) is an elegant and physically-grounded idea. It provides a clear mechanism for on-device parameter updates without conventional digital backpropagation.
Robustness to Hardware Non-Idealities: The demonstration of robustness to parameter mismatch (Figure 13, page 10) is a significant strength. The ability for the on-device learning to self-correct for analog hardware variations is a compelling advantage of this paradigm.
Principled Introduction of Nonlinearity: The use of Chebyshev polynomials to introduce nonlinearity is well-justified by their bounded nature, which is a critical consideration for implementation in a physical, voltage-constrained system (Section 3.3.2, page 6).

Weaknesses

Fundamentally Flawed Baseline for Accelerator Comparison: The performance comparison against SOTA GNN accelerators (Table 3, page 10) is built on the indefensible assumption that these accelerators "achieve 100% utilization on any graph" (Section 4.1, page 7). This is a critically flawed premise. Real-world accelerator performance is heavily dictated by memory bandwidth limitations, graph sparsity, and dataflow inefficiencies, making 100% utilization a theoretical ceiling that is never achieved in practice. This assumption artificially inflates the reported speedups (e.g., 115x) and invalidates the central performance claims against prior hardware work.
Lack of Rigor in Key Algorithmic Design Choices: The decision to fix the parameters h_i and only train J_ij is weakly justified. The authors state that "empirically, trainable h_i do not lead to better results" (Section 3.2.1, page 5) without providing any supporting data. This is a critical design choice that significantly alters the optimization landscape. An unsubstantiated empirical claim is not a substitute for a rigorous ablation study. The decision appears to be one of convenience to avoid instability, which points to a potential weakness in the model's formulation itself.
Unvalidated Simulation Environment: The results for DS-TPU are derived from a "CUDA-based Finite Element Analysis (FEA) software simulator" (Section 4.1, page 7). The paper provides no information on the validation of this simulator against a physical hardware prototype (e.g., the BRIM framework it is supposedly based on). Without such validation, it is impossible to assess the accuracy of the reported latency and power figures, which may not account for real-world analog effects like parasitic capacitances, process variation beyond simple mismatch, or thermal noise dynamics.
Understated Costs and Misleading Aggregation of Results: The paper heavily emphasizes performance benefits while downplaying the substantial hardware costs. As shown in Table 2 (page 8), moving from a linear model (DS-TPU-Linear) to the highest-performing 3rd-order model (DS-TPU-3rd) incurs a 4x area increase (8.5 mm² to 34.1 mm²) and a 4.4x increase in max power (1.3 W to 5.7 W). Furthermore, the headline claim of a "10.8% MAE reduction" is an average that masks highly variable performance. For instance, on the PEMS04-flow dataset, the improvement from DS-GL to DS-TPU-3rd is negligible (17.07 to 17.04, Table 1), suggesting the immense hardware overhead offers no practical benefit in some cases. A cost-benefit analysis is conspicuously absent.
Overstated "Lifelong Learning" Claim: The term "lifelong learning" implies continuous adaptation to non-stationary data distributions. The experiments presented are on static, pre-partitioned datasets (70/20/10 split). This is a standard supervised training setup, not a demonstration of lifelong learning. The work demonstrates on-device training, which is valuable, but using the term "lifelong learning" is a mischaracterization of the experimental validation.

Questions to Address In Rebuttal

Please provide a detailed justification for the "100% utilization" assumption for baseline hardware accelerators. Alternatively, provide a revised comparison using more realistic performance models for these accelerators that account for known bottlenecks like memory access and dataflow dependencies.
What steps were taken to validate the FEA simulator against physical hardware measurements? Please provide data on the simulator's accuracy in modeling latency, power, and non-ideal analog circuit behaviors.
Please provide the empirical results and analysis that support the claim that training h_i parameters does not lead to better results. This is a central design decision that requires concrete evidence.
The hardware cost for nonlinearity is substantial (Table 2). Can the authors provide a more nuanced analysis of the accuracy-area-power trade-off? For which specific applications is a 4x increase in area justified by the marginal accuracy gains observed?
Please clarify how the conducted experiments support the claim of "lifelong learning." Were the models tested on evolving data streams or in a continual learning context? If not, the authors should justify the use of this term over the more accurate "on-device training."

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents DS-TPU, a novel architecture and learning algorithm for dynamical system (DS)-based graph learning. The work tackles two fundamental limitations of prior DS accelerators like DS-GL: their inability to perform on-device training and their restriction to linear node interactions. The authors' core contribution is a tightly integrated algorithm-architecture co-design that introduces a physically-grounded, on-device learning mechanism they term "Electric Current Loss" (EC-Loss). This mechanism cleverly uses feedback currents within the analog circuit to represent the loss function, enabling continuous, "electron-speed" model refinement. To enhance model expressivity, the work incorporates nonlinear node interactions modeled by Chebyshev polynomials, a choice well-suited for the physical constraints of the hardware. The result is a system that promises orders-of-magnitude improvements in training speed and energy efficiency over conventional GPU-based methods, while also achieving higher accuracy than state-of-the-art Graph Neural Networks (GNNs).

Strengths

A Tangible Step Towards "Mortal Computation": The most significant contribution of this work is its elegant fusion of the learning rule with the physical hardware. The authors' framing of this within the context of Hinton's "forward-forward algorithm" and "mortal computation" (Section 2, page 2) is highly insightful. While not a direct implementation of Hinton's specific algorithm, the EC-Loss mechanism embodies its core philosophy: unifying inference and training on the same substrate to create a more efficient and biologically-plausible learning system. By mapping the abstract concept of a loss gradient to a measurable electric current, the paper provides a compelling blueprint for how such advanced computing paradigms can be realized.
Bridging the Analog Inference-Digital Training Gap: The field of analog and DS-based accelerators has long been hampered by a critical disconnect: ultra-fast, low-power inference is coupled with slow, power-hungry offline training on conventional digital hardware. This paper directly attacks this bottleneck. The proposed lifelong learning mechanism is not just an add-on; it is a fundamental re-imagining of how learning can occur in such systems. This has the potential to unlock the true promise of DS computing for edge applications where continuous adaptation to new data is essential.
Pragmatic Co-design for Enhanced Expressivity: The introduction of nonlinearity via Chebyshev polynomials is an excellent example of algorithm-hardware co-design. Instead of proposing a mathematically ideal but physically unrealizable function, the authors chose a class of functions (polynomials) that are not only powerful approximators but are also bounded and can be constructed from simpler monomial terms generated by analog circuits (Section 3.3, page 6; Figure 7, page 7). This demonstrates a deep understanding of both the theoretical requirements of machine learning models and the practical constraints of physical hardware.
Exceptional System-Level Performance: The claimed performance gains are truly staggering (Section 4, pages 7-10). The 810x speedup in training over an A100 GPU and 115x speedup in inference over SOTA GNN accelerators, if reproducible, would represent a major breakthrough. Even if these results represent a best-case scenario, the orders-of-magnitude difference highlights the profound potential of shifting from conventional digital paradigms to physics-based computing for this class of problems.

Weaknesses

My criticisms are less about flaws in the work and more about probing the boundaries of its contributions and understanding its place in the broader landscape.

Scalability and the Specter of N² Complexity: Like many fully-connected architectures, the DS-TPU faces an inherent O(N²) scaling challenge in its coupling units for N nodes. The authors briefly mention "sparse scaling" as a solution (Section 4.3, page 9), suggesting PEs are used to process graph partitions. This is a critical point that is underdeveloped. For the massive, sparse graphs common in the real world, the practical implementation and performance trade-offs of this sparse, partitioned approach will determine the true applicability of the architecture. A more detailed discussion of this is warranted.
Generalizability of the EC-Loss Principle: The paper derives the EC-Loss as an analog to MAE and MSE loss functions for regression-style graph prediction tasks. This is a fantastic result, but it raises the question of the principle's generality. Can this physical feedback mechanism be adapted for other crucial learning tasks, such as node classification, which would typically require a cross-entropy loss? Understanding the scope of problems for which a physical loss analog can be constructed is key to assessing the long-term impact of this technique.
Contextualization with Broader Analog AI: While the paper does an excellent job comparing itself to digital GNNs and accelerators, it exists within a wider context of analog AI and neuromorphic hardware that also promises in-situ training. For example, memristor crossbar arrays are also being explored for accelerating GNNs with in-memory computation and analog gradient descent. A brief discussion situating DS-TPU's unique dynamical-system-based approach relative to these other non-digital paradigms would strengthen the paper's contribution to the emerging hardware community.

Questions to Address In Rebuttal

Could the authors elaborate on the "sparse scaling" approach mentioned in Section 4.3 (page 9)? Specifically, how is the partitioning handled, what is the communication overhead between PEs, and how does this affect the "spontaneous" nature of the system's evolution towards a solution?
The EC-Loss mechanism is elegantly derived for regression losses (MAE/MSE). What are the authors' thoughts on the feasibility of extending this physical learning principle to classification tasks? Would this require a fundamental change to the node dynamics or the Hamiltonian, or could a clever circuit-level analog for a different loss function be designed?
From a broader perspective, what do the authors see as the primary advantages of this dynamical system approach for on-device learning compared to other prominent analog AI paradigms like memristor-based in-memory computing?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents DS-TPU, a dynamical system (DS) based accelerator for graph learning that claims two primary novel contributions. First, it introduces an on-device, lifelong learning mechanism by formulating a loss function as a physical "Electric Current Loss" (EC-Loss), which enables model parameter updates via a hardware feedback loop. Second, it incorporates a mechanism for modeling nonlinear node interactions by constructing them from Chebyshev polynomial expansions, a feature absent in prior linear DS-based accelerators. These two features are presented as a tightly coupled algorithm-architecture co-design intended to solve the training-inference performance gap and the limited expressiveness of previous works like DS-GL.

My analysis concludes that while the high-level concepts have precedents in adjacent fields, the specific formalization and hardware co-design for this class of accelerator are genuinely novel. The core innovation lies in the specific physical instantiation of an on-device learning rule (I_loss) and a hardware-friendly nonlinear interaction mechanism within a DS accelerator framework, which collectively represent a significant step forward from the prior art.

Strengths

The paper's primary strength is the successful and elegant mapping of abstract learning concepts onto physical, analog circuit behavior.

Novel On-Device Learning Formalism: The formulation of the loss function as a feedback electric current (I_loss = I_in - I_R as discussed in Section 3.2.1, page 4) is a genuinely novel contribution for this class of hardware. While the concept of in-situ or on-device training exists in domains like neuromorphic computing (e.g., using memristive crossbars), its direct translation to an Ising-like dynamical system, where the loss is a measurable current that directly drives the update of coupling parameters (conductances), appears to be new. This moves beyond simply running a known algorithm on new hardware; it derives a new learning rule from the physics of the hardware itself.
Hardware-Aware Nonlinearity: The standard Ising model is inherently linear in its pairwise spin interactions. While higher-order interaction models exist in statistical physics, they are notoriously difficult to map to scalable hardware. The authors' approach of using a series expansion to approximate arbitrary nonlinear functions is a known mathematical technique, but the novelty lies in the co-design. Specifically, the selection of Chebyshev polynomials for their bounded-value property (Section 3.3.2, page 6) is a clever, hardware-aware choice that respects the physical voltage limits of an electronic system. The architectural integration of a "Nonlinearity Generator" (Figure 7, page 7) and its coupling with the novel EC-Loss training mechanism to learn the polynomial coefficients (J^m_ij) is a non-trivial and novel system-level contribution.
Demonstration of a Full Loop: The work does not merely propose these ideas in isolation. It demonstrates a complete, closed-loop system where the nonlinear interactions are learnable on the device using the proposed physical feedback mechanism. This synergy between the two core claims is the paper's strongest element of novelty, distinguishing it from works that might propose one feature but not the other.

Weaknesses

The weaknesses of the paper relate primarily to the contextualization of its novelty with respect to broader, established fields.

Limited Acknowledgment of Neuromorphic Precedent: The concept of leveraging device physics for local, on-device learning is a cornerstone of neuromorphic engineering. The paper mentions Hinton's "forward-forward algorithm" as an inspiration but fails to connect its EC-Loss to the rich history of hardware-based Hebbian learning, Spike-Timing-Dependent Plasticity (STDP), or contrastive Hebbian learning, where physical quantities (e.g., charge, flux) are often used implicitly or explicitly to update synaptic weights in-situ. While the specific formalism for a DS accelerator is new, a discussion of how this work fits into the broader landscape of physical learning systems would strengthen its claimed novelty by clarifying the precise delta.
Function Approximation is Not Fundamentally New: The use of polynomial series to approximate functions is a classical technique. The novelty here is not the mathematics, but the specific engineering choice and hardware implementation. The paper could be strengthened by more explicitly stating this, and perhaps by justifying the choice of Chebyshev polynomials over other potential hardware-friendly basis functions (e.g., Legendre polynomials, or even a simple truncated power series) beyond a brief mention of their bounded nature. Is this choice fundamentally optimal, or merely a convenient and sufficient one? The paper does not provide this deeper analysis.

Questions to Address In Rebuttal

Could the authors elaborate on the relationship between the proposed "Electric Current Loss" mechanism and established on-device learning rules from the neuromorphic computing literature? Specifically, how does this feedback mechanism differ conceptually from, for instance, delta-rule implementations on memristor crossbars where voltage differences are used to induce resistance changes?
The selection of Chebyshev polynomials is justified by their bounded output. Were other basis functions considered for the series expansion? A simple power series (a_0 + a_1σ + a_2σ^2 + ...) might seem more straightforward. What are the specific hardware trade-offs (e.g., circuit complexity, stability, area) that make the Chebyshev basis superior for this particular DS-TPU architecture?
The cost of the proposed novel features is significant, as shown in Table 2 (page 8), where the area and max power of DS-TPU-3rd are ~5x and ~10x that of the baseline DS-GL, respectively. How does the novelty scale? As the order of the polynomial M or the number of nodes N increases, does the overhead of the CFMs and NGs threaten the viability of the approach, or are there architectural innovations that mitigate this cost explosion?

TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model

Abstract

Large- scale deep learning recommendation models (DLRMs) rely on embedding layers with terabyte-scale embedding tables, which present significant challenges to memory capacity. In addition, these embedding layers exhibit sparse and random data access ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes TRACI, a hardware-software co-design to accelerate the Aggregation operator in large-scale Deep Learning Recommendation Models (DLRMs). The authors correctly identify that communication in the embedding layers is a significant bottleneck. Their core idea is to exploit input reuse and output reuse opportunities inherent in the Aggregation pattern. To this end, they introduce a new memory primitive, GetReduce, and a modified network switch architecture featuring an in-switch cache (ISC) for input reuse and a reduction table (RTB) for output reuse. The evaluation, performed via cycle-accurate simulation, claims significant speedups for the Aggregation operator and notable end-to-end application speedups.

Strengths

Problem Motivation: The paper provides a clear and well-motivated analysis of the communication bottleneck in multi-GPU DLRM systems. Figure 3 effectively illustrates how communication dominates execution time, particularly as the compute-to-network bandwidth ratio increases.
Core Insight: The identification and quantification of input reuse and output reuse as distinct optimization opportunities (Section 2.3, Table 1) is a valid and useful insight. This forms a solid foundation for the proposed solution.
Architectural Concept: The proposal to expose reuse semantics to the network via a new primitive (GetReduce) is a logical approach. It correctly decouples the expression of the communication pattern from the underlying point-to-point transport, enabling network-level optimizations.

Weaknesses

My primary concerns with this work center on the practicality of the hardware implementation, the robustness of the evaluation methodology, and potential overstatement of the design's scalability.

Simplistic Cache Coherence Model: The proposed solution for cache coherence in the in-switch cache (ISC) is to "invalidate all cache blocks whenever a multi-GPU synchronization happens" (Section 5.3.2, Page 8). This is a brute-force approach that raises serious questions. While the authors claim this incurs "essentially no performance overhead" in training due to existing batch-level synchronizations, this assumption is fragile. It ignores the performance cost of a global cache flush and may not be applicable to latency-sensitive inference scenarios that lack clear, infrequent synchronization points. The paper provides no data to substantiate the claim that this mechanism is truly low-overhead.
Demonstrated Scalability Limitations: The paper's own results cast doubt on the scalability of the reduction mechanism. In Figure 15 (Page 12), the performance benefit from the "Reduction-only" configuration peaks at 64 GPUs and then declines for both the fbgemm_0 and DBLP datasets. The authors themselves acknowledge in Section 6.4.2 that "at 128 GPU and 256 GPU cases, the reduction table size is not large enough and some packets are bypassed". This is not a peripheral issue; it is a fundamental limitation of the proposed fixed-size RTB design. A solution that does not scale with the system size it is designed for is critically flawed.
Partitioned and Extrapolated Evaluation: The evaluation methodology is suspect. The communication performance of the Aggregation kernel is measured in isolation using a gem5-based simulator, while the performance of the rest of the model is estimated using Astra-sim. The end-to-end speedup is then derived by combining these two results (Section 6.5, Page 11). This partitioned approach fails to capture complex interactions between the communication fabric and the rest of the system. For instance, accelerated communication could alter memory access patterns or contention for on-chip resources in ways not captured by this disconnected simulation, making the extrapolated end-to-end results unreliable.
Understated Hardware Complexity and Overheads: The hardware overhead analysis in Section 6.6 is superficial. It relies on Cacti to estimate SRAM area for the cache and reduction table, claiming a mere 2.82% total area increase for the switch. This completely ignores the area, power, and timing complexity of the additional control logic, the modified flit processing pipeline (Figure 9), the content-addressable lookups for the cache/RTB, and the arithmetic units required for reduction. The proposed pipeline introduces new states and decision points that will substantially complicate the switch design beyond just the memory arrays. Power consumption, a critical factor in datacenter hardware, is not discussed at all.
Insufficient Deadlock Analysis: The deadlock prevention strategy for the RTB (Section 5.2.2, Page 7) is described heuristically. The paper proposes stalling newly injected messages while bypassing in-transit messages if the RTB is full. While this sounds plausible, network design requires rigorous proof of deadlock freedom. Without a more formal analysis of channel dependencies and resource allocation cycles, the claim that this design is deadlock-free is unsubstantiated.

Questions to Address In Rebuttal

The authors must address the following points to establish the validity and practicality of their work:

Please provide a quantitative analysis of the performance impact of your proposed global cache flush mechanism. How much time does the invalidation take across 256 GPUs? How does performance degrade if synchronization points become more frequent than once per batch?
The reduction table (RTB) performance clearly degrades at larger system scales as shown in your own Figure 15. Please address this fundamental scalability issue. Is the proposed design only viable for systems up to 64 GPUs, or do you have a credible proposal for making the RTB mechanism scale to hundreds or thousands of nodes?
Justify your partitioned evaluation methodology. What evidence can you provide that separating the simulation of network communication from the rest of the model execution does not hide critical second-order effects that would invalidate your extrapolated end-to-end speedup claims?
Regarding the GetReduce primitive, what are the specific implementation challenges and performance implications of adding support for this transaction to a real-world GPU's memory controller and network interface? The paper glosses over the host-side complexity.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents TRACI, a novel in-network acceleration architecture designed to address the communication bottleneck in large-scale Deep Learning Recommendation Models (DLRMs). The authors correctly identify that the Aggregation operator, central to DLRM embedding layers, is dominated by cross-GPU communication that is both voluminous and input-dynamic, making it a poor fit for existing acceleration techniques that target static patterns like All-Reduce.

The core contribution is a holistic hardware-software co-design. At the software level, they introduce a new memory-semantic primitive, GetReduce, which augments a standard memory read request with a reduction target address. This single change provides the network with the semantic context needed to identify optimization opportunities. At the hardware level, they propose a modified switch architecture featuring an in-switch cache to exploit input reuse (multicasting) and a reduction table to exploit output reuse (early reduction). By tackling this input-dynamic pattern directly within the network, TRACI achieves significant speedups (averaging 3.12x for a 64-GPU system) on the critical Aggregation operation.

Strengths

The true strength of this paper lies in its synthesis of several important research threads into a cohesive and compelling solution for a critical, real-world problem.

A Principled Leap Beyond Static Collectives: The community has invested significant effort in accelerating static communication patterns like All-Reduce. This paper makes the insightful observation that the next major frontier is dynamic, data-dependent communication. By choosing the DLRM Aggregation operator as their target, the authors ground their work in a problem of immense industrial relevance. The comparison in Figure 1 (Page 2) effectively frames this as a fundamentally different and harder challenge.
Elegant Hardware-Software Co-design: The GetReduce primitive is the lynchpin of the entire design. It is a simple yet powerful abstraction that exposes the application's intent (read from address I and add to address O) to the network. This allows the hardware—the in-switch cache and reduction table—to opportunistically and dynamically discover and exploit reuse without complex software-level scheduling. This is an excellent example of where a small change to the ISA/interface can unlock massive hardware optimization potential.
Holistic Optimization: Many prior works have focused on exploiting either input reuse (e.g., through caching) or output reuse (e.g., through near-memory or in-GPU reduction), as the authors correctly survey in Section 2.4 (Page 3). The key insight here is that for the Aggregation pattern, these two opportunities coexist and that the network is the ideal place to exploit both simultaneously. The ablation studies presented in Figure 10 (Page 9) clearly demonstrate that both the cache and the reduction table contribute significantly and that their combination is necessary to achieve the best performance across different datasets and system scales.
Contextual Positioning: The paper does a good job of positioning itself relative to adjacent fields. It connects to the rich history of in-network computing (from database query offload to HPC collectives) and in-network caching (from key-value stores), while clearly articulating how the unique constraints of GPU memory-semantic fabrics and DLRM workloads necessitate a new approach.

Weaknesses

The weaknesses of the paper are primarily related to the potential complexities and unexplored boundaries of an otherwise strong core idea.

Implementation Complexity and Overhead: The proposed flit processing pipeline in Figure 9 (Page 8) is considerably more complex than a standard network switch pipeline. The interactions between the reduction table (RTB), in-switch cache (ISC), and the main crossbar, including stall/bypass logic for deadlock avoidance, introduce significant design complexity. While the area overhead is analyzed (Table 4, Page 11) and appears modest, the potential impact on clock frequency, power consumption, and the latency of non-TRACI traffic is not discussed.
Pragmatic but Limited Coherence Model: The proposed coherence solution—invalidating the entire network cache on every training batch synchronization (Section 5.3.2, Page 8)—is a clever and pragmatic choice for the DLRM training use case. However, it implicitly tethers the architecture to workloads with clear, coarse-grained synchronization phases. This may limit the applicability of TRACI to other potential applications, such as real-time inference serving or more general-purpose graph processing, which may lack such clean synchronization boundaries.
Narrow Application Focus: The paper is laser-focused on the DLRM Aggregation operator. While this is a strength in terms of problem motivation, it leaves the reader wondering about the broader applicability of the GetReduce primitive and the TRACI architecture. Are there other significant communication patterns in domains like Graph Neural Networks, sparse linear algebra, or physics simulations that could benefit from a similar approach? A brief discussion of this would greatly strengthen the paper's perceived long-term impact.

Questions to Address In Rebuttal

Hardware Cost and Non-Blocking Performance: The area overhead analysis is helpful, but can you provide any estimates or qualitative discussion on the power and latency overhead of the TRACI switch components? Specifically, does the added logic in the pipeline increase the latency for standard Get or Put operations that do not use the caching or reduction features?
Robustness of Deadlock Avoidance: Your deadlock avoidance strategy (Section 5.2.2, Page 7) of stalling newly injected requests while allowing in-flight requests to bypass a full reduction table is interesting. Can you provide a more formal argument or stronger intuition as to why this is sufficient to prevent deadlock cycles, especially in very large and potentially congested networks?
Generalizability of the GetReduce Primitive: Could you speculate on the potential of the GetReduce primitive and the TRACI architecture beyond DLRMs? For example, could it be used to accelerate the "gather" and "aggregate" phases in Graph Neural Networks, which also exhibit input-dependent, many-to-one communication patterns? Answering this would help contextualize the work as a more general-purpose building block for future accelerators.

Review 3

Excellent. Here is a peer review of the paper "TRACI: Network Acceleration of Input-Dynamic Communication for Large-Scale Deep Learning Recommendation Model" from the perspective of 'The Innovator'.

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper introduces TRACI, a hardware-software co-designed architecture to accelerate the Aggregation operator in large-scale Deep Learning Recommendation Models (DLRMs). The authors identify that Aggregation is bottlenecked by cross-GPU communication and that its communication pattern is input-dynamic, precluding the use of existing acceleration techniques designed for static collectives like All-Reduce.

The authors' core claim of novelty rests on a three-part proposal: 1. A new memory-semantic primitive, GetReduce, which bundles an input address (IAddr) and an output/reduction address (OAddr) into a single transaction. 2. An in-switch cache that uses the IAddr from GetReduce requests to serve repeated reads for input data, thus exploiting input reuse. 3. An in-network reduction table that uses the OAddr to aggregate data from multiple responses on-the-fly, thus exploiting output reuse.

The central thesis is that the GetReduce primitive provides the network with the necessary semantic information to simultaneously and dynamically exploit both input and output reuse, a combination not addressed by prior art for this specific problem.

Strengths

The primary strength of this work lies in its novel synthesis of existing concepts to solve a new, well-defined problem.

The GetReduce Primitive: The proposed GetReduce transaction (Section 4, page 5) is the most genuinely novel element of this paper. Standard memory fabrics operate on point-to-point load/store or get/put semantics (e.g., address -> data). By creating a new primitive that explicitly encodes both the source of the data and the destination of the reduction, the authors provide a clean and powerful mechanism for the network to become an active participant in the computation. This is a strong example of a hardware-software co-design where a new instruction/transaction enables significant downstream hardware optimizations.
Novel Synthesis for a Dynamic Problem: While the constituent hardware ideas—in-network caching and in-network reduction—are not new in isolation, their combination to address the input-dynamic Aggregation operator is. Prior work has extensively studied in-network reduction for static collectives like All-Reduce (e.g., Klenk et al., ISCA 2020 [22]; NVIDIA SHARP) and in-network caching for key-value stores (e.g., NetCache, SOSP 2017 [17]; Concordia, FAST 2021 [42]). This paper correctly identifies that neither approach alone is sufficient for Aggregation, which exhibits both input reuse (amenable to caching) and output reuse (amenable to reduction). The co-design of a switch pipeline that handles both is a novel contribution in this context.
Clear Problem Formulation: The paper does an excellent job of distinguishing the Aggregation problem from the well-studied All-Reduce problem (Figure 1, page 2). Highlighting the input-dependent communication pattern is key to justifying why a new solution is necessary, thereby carving out a clear space for a novel contribution.

Weaknesses

The novelty of the work is concentrated in the synthesis and the enabling primitive, not the individual components. The paper could be more precise about this positioning.

Constituent Mechanisms are Adaptations of Prior Art: The in-network reduction mechanism is conceptually an adaptation of prior work on collectives to a request-response model. The use of a reduction table with counters to manage dynamic arrivals is an engineering delta, but the core concept of stateful in-network aggregation is well-established. Similarly, the in-switch cache is functionally analogous to prior work on in-network caching for key-value stores. The paper's novelty would be stronger if it more explicitly framed these components as "novel adaptations" rather than implicitly presenting them as entirely new ideas. The innovation is in how they are adapted and combined, which should be the focus.
Understated Complexity of the GetReduce Primitive: The paper claims, "The only change in software is to re-implement the embedding layer..." (Section 3, page 5). This understates the novelty and challenge of introducing a new memory-semantic transaction. Supporting GetReduce would likely require changes to the GPU's memory controller, the network interface, the low-level driver, and possibly the compiler or programming model libraries (e.g., CUDA) to expose this functionality. While the abstraction is elegant, its implementation cost across the stack is non-trivial, and this aspect of the novel proposal is not fully explored.
Simplistic Coherence Model: The proposed solution for cache coherence is to invalidate the entire in-network cache at global synchronization boundaries (Section 5.3.2, page 8). While this is pragmatic for the evaluated DLRM training workloads, it is a known and simple technique. This approach limits the novelty of the caching system, as it sidesteps the more complex challenges of coherence that prior in-network caching systems (like Concordia [42]) have tackled. This design choice may constrain the architecture's applicability to workloads without frequent, cheap global synchronization points.

Questions to Address In Rebuttal

The novelty of the in-network reduction mechanism appears to be its adaptation to an asynchronous, input-dynamic request-response pattern. Could the authors please elaborate on the specific hardware differences in their reduction table and its state machine logic compared to the architecture proposed by Klenk et al. [22] for shared-memory collectives? Is the primary novelty the counter-based mechanism for handling an unknown number of responses?
The GetReduce primitive is a powerful and novel abstraction. However, its practical implementation seems to require deep cross-stack integration. Could the authors provide more detail on where this new semantic would be handled? For example, is this a new ISA instruction for the GPU, a specific library call that gets translated by the driver, or a new operation type handled exclusively by the network interface controller upon seeing a specific memory address pattern?
The combination of in-network caching and reduction in one switch is the core of the hardware novelty. Do these two features interact? For instance, does a request that will be part of a reduction (OAddr matches a table entry) receive different treatment by the caching (IAddr) logic? Or are they entirely independent stages in the pipeline as Figure 9 (page 8) seems to suggest?

FlexNeRFer: A Multi-Dataflow, Adaptive Sparsity-Aware Accelerator for On-Device NeRF Rendering

Abstract

Neural Radiance Fields (NeRF), an AI-driven approach for 3D view reconstruction, has demonstrated impressive performance, sparking active research across fields. As a result, a range of advanced NeRF models has emerged, leading on-device applications to ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the perspective of 'The Guardian'.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present FlexNeRFer, a versatile hardware accelerator for on-device Neural Radiance Field (NeRF) rendering. The proposed architecture aims to efficiently support a diverse range of NeRF models by incorporating three key features: i) a flexible, hierarchical Network-on-Chip (NoC) for multi-dataflow support on a ii) precision-scalable MAC array, and iii) an online, adaptive data compression scheme that selects the optimal sparsity format based on runtime data characteristics. The authors claim significant speedup and energy efficiency improvements over a commodity GPU (NVIDIA RTX 2080 Ti) and a recent state-of-the-art NeRF accelerator (NeuRex).

While the paper addresses a relevant problem, the claims of superiority rest on a series of architectural choices whose novelty appears incremental and whose benefits are asserted through an evaluation methodology that lacks rigor, particularly concerning the baseline comparisons.

Strengths

Problem Formulation: The paper correctly identifies a critical gap in the field: the need for a single, flexible hardware architecture that can efficiently accelerate the growing diversity of NeRF models, moving beyond single-model point solutions. The motivation is clear and well-argued.
Workload Characterization: The initial analysis in Section 3 is sound. The breakdown in Figure 3 correctly identifies GEMM/GEMV operations and feature encoding as primary bottlenecks. The analysis in Figure 4, which illustrates the potential for MAC underutilization in existing dense accelerators (NVDLA, TPU) when handling irregular or sparse workloads common in NeRF, provides a solid foundation for the architectural requirements.
Sparsity Format Analysis: The investigation presented in Figure 7 (page 7), which demonstrates that the optimal compressed data format is dependent on both sparsity ratio and data precision, is a valuable piece of characterization. It provides a clear, data-driven justification for the paper's focus on an adaptive compression scheme.

Weaknesses

Fundamentally Flawed GPU Comparison: The headline claims of 8.2–243.3× speedup and 24.1–520.3× energy efficiency improvement over an NVIDIA RTX 2080 Ti are almost certainly artifacts of an inequitable comparison. An ASIC designed for a specific domain will naturally outperform a general-purpose processor. However, the magnitude of these gains strongly suggests that the GPU baseline is not optimized to leverage the same sparsity that the proposed hardware exploits. The paper makes no mention of using sparsity-aware libraries (e.g., cuSPARSE) or custom kernels for the GPU. It appears the authors are comparing their specialized sparsity-aware hardware against naive, dense matrix operations on the GPU. This is an invalid comparison and renders the associated claims unsubstantiated.
Insufficient Justification for Hardware Overheads: The proposed architecture comes at a significant cost. According to the authors' own data in Figure 16 (page 12), FlexNeRFer is 55% larger (35.4 mm² vs. 22.8 mm²) and consumes 43% more power at INT16 (7.3 W vs. 5.1 W) than the NeuRex baseline. The justification for this substantial overhead is "flexibility." However, the evaluation fails to convincingly demonstrate that this flexibility yields a commensurate performance benefit. Figure 19 (page 13) aggregates results across models, obscuring the critical trade-offs. A rigorous evaluation would require a model-by-model breakdown, showing exactly how much performance NeuRex (optimized for Instant-NGP) loses on other models and how much FlexNeRFer gains, thereby justifying its PPA cost.
Incremental Novelty of Architectural Components: The core architectural contributions, upon inspection, appear to be incremental modifications of prior work rather than foundational innovations.
- MAC Array: The bit-scalable MAC unit, which is central to the design, is explicitly adopted from Bit Fusion [71].
- Network-on-Chip: The HMF-NoC is presented as a key feature, yet it is an extension of the HM-NoC from Eyeriss v2 [24], with the addition of a feedback path (Figure 9, page 7). The claimed 2.5× energy reduction for on-chip memory access compared to HM-NoC is not substantiated with a detailed analysis of the traffic patterns that would benefit from this feedback loop.
- Adaptive Sparsity: The online mechanism for choosing a sparsity format is a plausible engineering solution. However, the hardware for this (Figure 13(b), page 10) introduces its own latency and area overhead, which is not isolated or quantified. The critical claim that this online conversion leads to a "72% reduction in DRAM access time" (page 12) is provided without any supporting data, simulation traces, or methodological explanation.
Ambiguous Baseline and Metric Reporting: The paper includes a "bit-scalable SIGMA" in its MAC array comparison (Table 3, page 11). This appears to be a strawman baseline constructed by the authors, as its design and optimization choices are not detailed, making it impossible to verify the fairness of the comparison.

Questions to Address In Rebuttal

GPU Baseline: Please clarify the exact implementation of the GPU baseline. Specifically, for operations involving pruned or sparse matrices, were sparsity-aware libraries (e.g., cuSPARSE) or custom CUDA kernels that exploit sparsity utilized? If not, how can the comparison be considered fair? Please provide performance data against a sparsity-optimized GPU baseline.
Cost-Benefit Analysis vs. NeuRex: Provide a detailed, model-by-model performance and energy breakdown comparing FlexNeRFer and NeuRex. How does FlexNeRFer perform on Instant-NGP (NeuRex's target workload)? Conversely, what is the performance degradation of NeuRex on models like IBRNet, Mip-NeRF, and vanilla NeRF that would justify FlexNeRFer's 55% area and 43% power overhead?
Quantification of Online Sparsity Overhead: What are the precise area, power, and latency overheads of the online sparsity format selection logic (i.e., the popcount-based SR calculator and flexible encoder)? How was this overhead factored into the total system results reported in Figures 16 and 19?
Substantiation of DRAM Access Reduction: The claim of a "72% reduction in DRAM access time" (page 12) is a cornerstone of the argument for adaptive compression. Please provide the methodology and data (e.g., from simulation traces) that support this specific figure.
"Bit-scalable SIGMA" Baseline: Please provide the detailed microarchitecture and design parameters for the "bit-scalable SIGMA" baseline used in Table 3. Without this information, the claims of superior efficiency are unverifiable.

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents FlexNeRFer, a hardware accelerator architecture designed for on-device Neural Radiance Field (NeRF) rendering. The authors argue that as the NeRF landscape diversifies, existing hardware solutions like GPUs are too inefficient (in terms of power, performance, and area) and specialized NeRF accelerators (e.g., NeuRex) lack the versatility to handle the wide range of emerging models.

The core contribution is an architecture that achieves flexibility through the co-design of three key components: 1) a hierarchical, multi-dataflow Network-on-Chip (NoC) built upon a precision-scalable MAC array, enabling efficient mapping of diverse and sparse workloads; 2) an optimized reduction tree for aggregating results efficiently; and 3) a novel adaptive sparsity-aware data compression scheme. A key insight driving this third component is the authors' analysis showing that the optimal compressed data format (e.g., COO, CSC, Bitmap) depends not only on the sparsity ratio but also on the data's numerical precision. FlexNeRFer dynamically determines the best format for input data in real-time to minimize memory footprint and data movement.

Evaluated using a 28nm process, FlexNeRFer is shown to significantly outperform both a high-end GPU (NVIDIA RTX 2080 Ti) and the state-of-the-art NeuRex accelerator across a suite of seven representative NeRF models, demonstrating substantial gains in speed and energy efficiency.

Strengths

This is a well-motivated and thoughtfully designed piece of work that makes a significant contribution to the field of domain-specific acceleration for neural graphics. Its primary strengths are:

Excellent Problem Formulation and Contextualization: The paper does an outstanding job of positioning itself within the current research landscape. Instead of narrowly focusing on accelerating a single algorithm, it correctly identifies the "Cambrian explosion" of NeRF models as a critical challenge for hardware design. The analysis in Section 3, particularly Figure 4 (page 5), which illustrates the inefficiency of existing DNN accelerators on NeRF workloads, provides a compelling motivation for a new, more flexible approach. The brief but insightful comparison to LLM acceleration challenges (Section 2.1.2, page 3) further demonstrates a mature understanding of the broader accelerator domain.
Novel Synthesis of Architectural Concepts: While individual components like bit-scalable MAC arrays or flexible dataflow interconnects have been explored in the general DNN accelerator literature (e.g., Bit Fusion, SIGMA), the strength of this work lies in their intelligent synthesis. The design is holistically driven by the specific demands of the NeRF domain. The standout contribution here is the adaptive sparsity format selection (Section 4.3, page 10). The analysis in Section 3.2.3 (page 6, Figures 7 & 8) that the optimal format is a function of both sparsity and precision is a crisp, valuable insight that directly translates into an effective architectural feature. This moves beyond static compression schemes and creates a truly dynamic and adaptive system.
High Potential for Impact and Longevity: By prioritizing flexibility, the FlexNeRFer architecture is not brittle; it is designed to outlive any single NeRF algorithm. As neural rendering techniques continue to evolve—perhaps incorporating different network backbones or new primitives—an architecture that can efficiently handle irregular dataflows, dynamic sparsity, and variable precision is exactly what will be needed. This work provides a strong blueprint for future "neural graphics processors" and could influence commercial designs for AR/VR and mobile SOCs.

Weaknesses

The work is strong, and the weaknesses are relatively minor in the face of its core contribution. They are primarily areas where the exploration could be deepened.

Limited Exploration of the Software/Compiler Stack: The paper focuses heavily on the hardware architecture. While the hardware is designed to be flexible, it's not entirely clear how this flexibility would be programmed or controlled. The RISC-V controller is mentioned, but the paper would benefit from a discussion of the software/compiler challenges. For example, how are different dataflows for various NeRF models mapped onto the hardware? Is this done manually, or could a compiler automatically determine the optimal dataflow and sparsity settings for a given layer?
Evaluation Scope Could Be Broader: The paper rightly identifies that the NeRF field is diversifying. While the seven models chosen for evaluation are representative, the neural rendering field is already moving towards other primitives, most notably 3D Gaussian Splatting. Given that Gaussian Splatting also involves processing sparse, unstructured data, it would be a powerful demonstration of FlexNeRFer's "versatility" to include an analysis of how well the architecture maps to this emerging class of rendering algorithms. This is more of a suggestion for future work than a flaw, but it would have elevated the paper's impact even further.

Questions to Address In Rebuttal

Regarding the adaptive sparsity format mechanism (Section 4.3, page 10): The system calculates the sparsity ratio of input activations on-the-fly and encodes them. Could you elaborate on the latency and energy overhead of this "Sparsity-Ratio-Calculation -> Format-Selection -> Encoding" pipeline? Figure 18a shows the latency of "Format Dec./Enc.", but how does the real-time calculation and decision-making process itself impact the critical path and overall cycle time?
The paper's core premise is flexibility to support diverse models. How would the proposed architecture handle a NeRF model that relies more heavily on a Transformer backbone (as mentioned in the introduction)? These models introduce large-scale Softmax and LayerNorm operations, which are not explicitly addressed. Does the GEMM/GEMV unit's flexibility extend to accelerating these components, or would they become new bottlenecks on the proposed system?
Could you provide more intuition on the trade-offs of the hierarchical HMF-NoC? Specifically, what is the area/power cost of the additional feedback paths and 3x3 switches compared to the HM-NoC in Eyeriss v2, and how did you balance this against the energy savings from reduced memory access that you report?

Review 3

Excellent. I will now embody the persona of "The Innovator" and provide a peer review for the paper "FlexNeRFer."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present FlexNeRFer, a hardware accelerator for on-device Neural Radiance Field (NeRF) rendering. The core thesis is that existing accelerators are too specialized for specific NeRF models, while GPUs are too inefficient. The proposed solution is a versatile architecture designed to handle a diverse range of NeRF workloads, particularly those involving varying levels of sparsity and numerical precision. The authors claim novelty in two main areas: 1) a flexible Network-on-Chip (NoC) designed for a precision-scalable MAC array that supports multiple dataflows and sparsity, and 2) a dynamic data compression scheme that selects an optimal sparsity format in real-time based on the data's sparsity ratio and precision.

While the paper presents a very well-engineered system with impressive results, its primary contributions are more of a clever and powerful synthesis of existing architectural concepts rather than the introduction of fundamentally new ones. The most significant novel element is the online, hardware-managed selection of data compression formats.

Strengths

Online Adaptive Sparsity Format Selection: The paper's most salient novel contribution is the mechanism for dynamic, sparsity-aware data compression (Section 4.3, page 10). The observation that the optimal format (COO, CSC/CSR, Bitmap) depends on both sparsity ratio and bit-width (Figure 7, page 7) is well-motivated. The proposed hardware mechanism to calculate the sparsity ratio of input data on a per-tile basis in real-time and then select the optimal format (Figure 13, page 10) is a genuinely new approach for accelerator design. Most prior art commits to one or two formats at design time or compile time; this work proposes to make this decision dynamically in hardware, which is a significant conceptual step.
Thorough Problem-Space Analysis: The authors perform a commendable analysis of NeRF model bottlenecks (Section 3.1, page 5) and the limitations of existing dense accelerators when handling irregular and sparse workloads (Section 3.2.1, page 5). This analysis provides a strong justification for the architectural choices made.

Weaknesses

My critique is centered exclusively on the degree of novelty of the core architectural components, setting aside the application to NeRF.

The "Flexible Network-on-Chip (HMF-NoC)" is an Incremental Extension of Prior Art: The proposed distribution network is heavily based on existing work. The authors themselves state that their Hierarchical Mesh with Feedback NoC (HMF-NoC) is an "extended NoC based on HM-NoC introduced in [24]" (Eyeriss v2). The described extensions—adding a feedback loop and modifying the switch from 2x2 to 3x3—are incremental engineering improvements to adapt the existing topology to their specific data movement needs. The core concept of a hierarchical, reconfigurable mesh for supporting diverse dataflows (unicast, multicast, broadcast) in a spatial accelerator is the central contribution of Eyeriss v2. Therefore, claiming this NoC as a key novel component is an overstatement; it is a skillful adaptation.
The Reduction Tree (RT) is a Synthesis of Known Techniques: The design of the reduction tree (Section 4.2, page 10) is presented as a key feature. However, the use of comparators and bypassable adders to flexibly manage partial sums from sparse inputs is a known technique. This concept is functionally identical to the reduction networks in accelerators like Flexagon [51] and Trapezoid [93], which were designed to handle sparsity by matching indices before accumulation. The novelty here is limited to the optimization of shifters within the specific context of the bit-scalable MAC unit from Bit Fusion [71], which is a valuable but localized engineering optimization, not a new architectural paradigm.
Complexity vs. True Novelty: The architecture integrates a bit-scalable MAC array [71], a hierarchical NoC [24], and a flexible reduction tree [51, 93] with a novel dynamic format selector. The resulting system is highly complex. While the performance gains are significant, the novelty must be weighed against this complexity. The core architectural "inventions" are limited, while the engineering effort in integrating and optimizing these disparate known concepts is immense. The paper presents itself as offering a novel architecture, when it is more accurately described as a novel and complex integration of architectures.

Questions to Address In Rebuttal

On the HMF-NoC: Please elaborate on the fundamental architectural novelty of the HMF-NoC beyond the extensions (feedback path, 3x3 switch) to the HM-NoC from Eyeriss v2 [24]. What new capabilities does this enable that were fundamentally impossible or intractable with the original HM-NoC or other flexible NoCs like the one in FEATHER [84]?
On the Overhead of Dynamic Compression: The core novel idea is the online sparsity format selection. What is the latency and area/power overhead of the hardware block responsible for this? Specifically, how many cycles are required to: a) fetch a tile, b) calculate its sparsity ratio via popcounting, c) consult the logic to select the optimal format, and d) encode the tile into that format? Is this overhead fully accounted for in the end-to-end latency results presented in Figure 18?
Justification for Multi-Format Support: The adaptive format selection adds significant control complexity. Could a simpler system that supports only two formats (e.g., CSC/CSR for moderately sparse data and Bitmap for extremely sparse data) achieve, for instance, >90% of the performance and efficiency benefits of your fully adaptive three-format system? A comparison against such a baseline would help quantify the marginal benefit gained from the added complexity of supporting and dynamically choosing from all formats.

BingoGCN: Towards Scalable and Efficient GNN Acceleration with Fine-Grained Partitioning and SLT

Abstract

Graph Neural Networks (GNNs) are increasingly popular due to their wide applicability to tasks requiring the understanding of unstructured graph data, such as those in social network analysis and autonomous driving. However, real-time, large-scale GNN ...

ACM LINK

Reviews

Review 1

Here is a peer review of the paper from the perspective of "The Guardian."

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The paper proposes BingoGCN, a GNN acceleration framework designed to mitigate the memory-bound nature of GNN inference on large graphs. The authors introduce two primary techniques: 1) Cross-Partition Message Quantization (CMQ), a method based on online vector quantization to summarize inter-partition node features, thereby aiming to eliminate irregular off-chip memory accesses during message passing. 2) A fine-grained (FG) structured pruning approach for the Strong Lottery Ticket (SLT) theory, intended to shift the performance bottleneck from memory to computation and then efficiently handle that computation. The authors present an FPGA-based hardware architecture to support this flow and evaluate it on several graph datasets.

However, the paper suffers from overstated claims, methodological weaknesses in its evaluation, and questionable novelty of its core contributions. While the problem is significant, the proposed solutions are not as robustly proven as the authors assert.

Strengths

The paper addresses the critical and well-known challenge of memory-bound GNN inference on large graphs, which is a significant bottleneck for practical deployment.
The motivation is well-established. The analysis in Section 1 and Figures 1 & 2 correctly identifies the trade-off in graph partitioning: finer partitions improve locality but exacerbate the problem of inter-partition communication.
The core concept of applying online vector quantization to summarize inter-partition messages (CMQ) is an interesting direction to explore for managing irregular access patterns.
The work is well-structured, and the proposed architecture is described in detail in Section 4, providing a clear picture of the proposed hardware implementation.

Weaknesses

Overstated Claim of "No Accuracy Loss" for CMQ: The abstract makes the bold claim that CMQ "eliminates irregular off-chip memory access without additional training and accuracy loss." This is a significant overstatement. CMQ is based on vector quantization, which is an inherently lossy compression technique. It replaces a set of unique node feature vectors with a single, shared centroid vector. This process fundamentally loses information. The authors' own results in Figure 15 (page 10) show a clear, albeit small, accuracy gap between their CMQ approach ("Ours (CMQ)") and the "DWL Models (Baseline)" for both OGBN-Arxiv and Reddit datasets. The claim of "no accuracy loss" is therefore contradicted by the paper's own data. The correct description is that it achieves comparable accuracy with minimal loss, which is a much weaker and more standard claim for a quantization method.
Questionable Novelty of FG-Structured SLT: The paper presents its "fine-grained (FG) structured pruning" for SLT as a novel contribution (Abstract, Section 1). However, the technique described in Section 3.2 appears to be a straightforward application of N:M block sparsity (which they cite as Density Bound Block [38, 56]) to the supermasks generated by the SLT framework. While combining these two existing ideas may be effective, presenting it as a novel pruning algorithm or a fundamental extension to SLT theory is a stretch. It appears to be an engineering adaptation rather than a novel scientific contribution.
Fundamentally Flawed SOTA Comparison: The comparison against other SOTA accelerators in Table 3 (page 12) is methodologically unsound. As stated in Section 5.4.5 (page 13), this comparison uses "a two-layer GCN with an embedding dimension of 16." This is a toy model. The main experiments in the rest of the paper use a much more realistic hidden dimension of 192 (Section 5.1.1, page 9). Using an extremely small model for the SOTA comparison invalidates the results. Competing accelerators (e.g., FlowGNN, I-GCN) may be architected and optimized for realistic workloads, and their performance characteristics do not necessarily scale linearly down to such a small embedding size. This choice appears designed to artificially inflate the "Resource Normalized Latency" speedup of the proposed work. A valid comparison must be conducted using the same realistic model configurations.
Insufficient Analysis of CMQ Overheads and Scalability: The paper claims CMQ shifts the bottleneck to computation, but it fails to rigorously analyze the overhead of CMQ itself. The CMQ engine (Section 4.5) performs distance calculations, codebook queries, and online updates. These operations consume computational resources and time. While the paper claims this is pipelined, it does not quantify this overhead or analyze how it scales. For truly massive graphs with a huge number of inter-partition nodes, the number of centroids required (even at a 1% ratio) and the size of the codebooks could become substantial, introducing a new bottleneck. The paper provides no analysis to assuage this concern.
Potential Simplification of the SLT Bottleneck: The narrative that SLT cleanly solves the new compute bottleneck is oversimplified. While SLT reduces MAC operations, their FG-structured sparse approach introduces new complexities. The PE (Figure 12, page 8) must handle a compressed mask format (Mask Value and Mask Index), which requires extra decoding logic and potentially irregular access to the on-chip supermask buffer. This control overhead is non-trivial and is not adequately contrasted with the data-path savings.

Questions to Address In Rebuttal

Please justify the claim of "no accuracy loss" in the abstract, given that CMQ is an approximation method based on VQ and the results in Figure 15 show a performance gap to the baseline. We expect authors to use precise language; "comparable" is not the same as "no loss."
Can the authors clarify the novelty of "FG structured pruning with SLT" beyond the application of existing block sparsity techniques (e.g., N:M sparsity) to the SLT framework? What is the fundamental algorithmic contribution here?
Please defend the choice of a 16-dimensional embedding GCN for the SOTA comparison in Table 3. How would the results change if a more realistic, larger model (e.g., 192-dim as used elsewhere in the paper) was used, for which competing accelerators may be better optimized? The current comparison is not credible.
What is the computational and memory overhead of the CMQ unit itself (distance calculations, codebook storage, updates), and how do you project this to scale to graphs significantly larger than Reddit, where the absolute number of inter-partition nodes could become prohibitive?
The parallel RNG design in Section 4.3 (page 9) uses a heuristic for seed generation to enable parallel processing. Has any statistical analysis (e.g., TestU01, Dieharder) been performed to ensure that this approach does not introduce correlations or degrade the randomness quality required for the SLT theory to hold?

Review 2

Of course. Here is a peer review of the paper from the perspective of 'The Synthesizer'.

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces BingoGCN, a GNN acceleration framework designed to address the critical memory communication bottleneck in large-scale GNN inference. The authors identify a fundamental dilemma in existing partitioning-based accelerators: increasing the number of partitions improves data locality but explodes the number of inter-partition edge traversals, leading to more irregular off-chip memory access.

The core contribution is a synergistic, two-part solution to this problem. First, it proposes Cross-Partition Message Quantization (CMQ), a novel technique that uses online vector quantization to create compact summaries (centroids) of inter-partition node features. This elegantly transforms the problem of irregular, high-bandwidth memory fetches into a manageable on-chip computation and lookup task, thereby enabling aggressive, fine-grained partitioning without performance degradation. Second, having shifted the system bottleneck from memory to computation, the framework incorporates the Strong Lottery Ticket (SLT) hypothesis to drastically reduce the computational workload. Crucially, the authors adapt SLT for hardware by introducing a Fine-Grained (FG) structured pruning method, which overcomes the challenge of SLT's inherent unstructured sparsity and allows for efficient, load-balanced execution on a PE array with on-the-fly weight generation. The authors validate their approach with an FPGA-based implementation, demonstrating significant performance and energy efficiency gains over state-of-the-art GNN accelerators, particularly on large graph datasets.

Strengths

This work stands out for its insightful system-level perspective and its elegant synthesis of ideas from disparate domains.

A Novel and Elegant Solution to a Foundational Problem: The challenge of inter-partition communication in graph processing is a long-standing one. The typical solutions involve either limiting partition granularity, duplicating data, or using sophisticated sampling schemes (as in BNS-GCN). The introduction of CMQ is a genuinely novel approach that reframes the problem. Instead of fetching exact data, it fetches an approximate, quantized representation. This connection to vector quantization—a technique more commonly seen in areas like approximate nearest neighbor search or data compression—is a creative and powerful intellectual leap. It provides a principled way to manage the information-performance trade-off at partition boundaries.
Excellent Synergistic Co-design: The true strength of this paper is not just in its individual components, but in how they are combined. The authors correctly identify that solving the memory access problem with CMQ makes computation the new bottleneck. This foresight motivates the integration of SLT. This is a hallmark of strong systems research: one solution is used to create a new, more tractable problem, which is then solved by a second, complementary solution. The result is a holistic framework where fine-grained partitioning, message quantization, and computational pruning work in concert.
Bridging Advanced ML Theory and Hardware Reality: Both SLT and VQ are powerful theoretical concepts, but their direct application in hardware is non-trivial. This paper does an excellent job of bridging that gap. The development of FG structured pruning (Section 3.2, page 6) is a key contribution in its own right. It takes the unstructured, hardware-unfriendly sparsity of SLT and imposes a structure that enables load-balancing and efficient weight generation. This demonstrates a deep understanding of both the machine learning algorithm and the underlying hardware constraints.
Strong and Convincing Evaluation: The experimental results are thorough and well-presented. The authors effectively isolate the benefits of their contributions, for instance, in Figure 21 (page 12), which quantifies the separate impacts of CMQ and SLT. The robustness analysis in Figures 15 and 16 (page 10) convincingly shows that CMQ maintains high accuracy even with high partition counts and low centroid ratios. The performance comparison against other accelerators, particularly the scaling result for BingoGCN(D) on the Reddit dataset (Figure 20, page 12), powerfully illustrates their central claim: by solving the memory bottleneck, their architecture becomes compute-bound and thus scales much more effectively with additional resources.

Weaknesses

The weaknesses of the paper are primarily related to the boundaries of its claims and opportunities for deeper exploration, rather than fundamental flaws in the core idea.

Limited Exploration of CMQ's Generalizability: The CMQ concept is demonstrated effectively for GCNs. However, the GNN landscape is vast, including models like Graph Attention Networks (GATs) where messages are not just aggregated features but are weighted based on attention scores. It is unclear how a pre-computed centroid could participate in an attention mechanism. The paper would be strengthened by a discussion of the applicability or necessary adaptations of CMQ for other major GNN families.
Potential Overheads of Online Clustering: While the paper claims the CMQ update process can be pipelined and its overhead hidden, the computational cost of online clustering is non-zero. The hierarchical approach (Section 3.1, page 5) mitigates this, but in scenarios with an extremely high number of inter-partition nodes and dynamic graphs, this update logic could become a bottleneck itself. A more detailed analysis of the CMQ unit's latency and resource cost as a function of inter-partition graph statistics would add further credibility.
Focus on Inference: The work is explicitly framed for GNN inference. The SLT theory, however, originated from finding sparse trainable subnetworks. While inference is a critical and valid target, the paper misses an opportunity to connect with the broader context of the full GNN lifecycle. A brief discussion on the challenges and potential of extending this architecture to support on-device or federated training scenarios would broaden its impact.

Questions to Address In Rebuttal

The core idea of CMQ relies on summarizing feature vectors. How would this approach extend to more complex message-passing schemes, such as in GATs, where pairwise interactions between specific source and destination nodes are needed to compute attention weights? Does using a single centroid for multiple nodes break such mechanisms?
Regarding the online CMQ update, how does the system's performance scale with the "boundary-to-interior" node ratio of a graph partition? Is there a tipping point where the computational cost of calculating distances and updating centroids (even with the hierarchical approach) ceases to be negligible compared to the GNN's primary computation?
The quality of the random weights generated for the SLT mechanism is critical for model accuracy. You use Xorshift16 for its hardware efficiency (Section 4.3, page 9). Given that SLT results can be sensitive to the initial weight distribution, have you validated that this PRNG's statistical properties are sufficient for more complex tasks or deeper models compared to more robust, but costlier, random number generators?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes BingoGCN, a hardware acceleration framework for Graph Neural Networks (GNNs) designed to address the memory access bottleneck in large-scale inference. The authors' claims to novelty are centered on two primary techniques:

Cross-Partition Message Quantization (CMQ): A method to handle inter-partition communication in partitioned graphs. Instead of fetching remote node features from off-chip memory—an irregular and costly operation—CMQ performs online vector quantization on these features. It creates a compact, on-chip codebook of "centroids" that serve as proxies for the actual features, thereby converting irregular off-chip accesses into regular on-chip lookups.
Fine-Grained Structured Pruning for Strong Lottery Tickets (FG-SLT): An adaptation of the Strong Lottery Ticket (SLT) theory for GNNs. The authors address the hardware-unfriendly nature of SLT's unstructured sparsity by integrating it with a fine-grained (FG) structured pruning method. This enforces a fixed number of non-zero elements within small blocks of the weight matrix, aiming to improve PE utilization and load balancing during on-chip weight generation.

The paper presents an FPGA-based architecture incorporating these ideas and demonstrates significant performance and energy efficiency improvements over existing accelerators.

Strengths

The primary strength of this work lies in the clever synthesis of existing, disparate concepts to solve a well-defined and critical problem in GNN acceleration.

Novel Application of VQ for Inter-Partition Communication: The core idea behind CMQ is a novel application of a known technique. While graph partitioning is standard (e.g., METIS) and vector quantization is a classic compression algorithm, their combined use to dynamically create on-chip summaries of remote node features is, to my knowledge, new in this specific context. It directly targets the inter-partition edge problem, which, as the authors correctly identify in Figures 1 and 2, is a major scalability limiter for fine-grained partitioning. This is a creative engineering solution.
Pragmatic Enhancement to SLT Theory for Hardware: The SLT theory is powerful but its resulting unstructured sparsity is a known challenge for hardware designers. The authors' contribution is to merge the SLT concept (randomly initialized weights pruned by a supermask) with the hardware-friendly constraint of structured sparsity (what they call FG, akin to N:M sparsity). The algorithm proposed in Section 3.2 (page 6) to combine multi-coated supermasks (MSup) with FG constraints appears to be a novel, albeit incremental, algorithmic contribution that makes SLT more viable for hardware acceleration.

Weaknesses

My critique is focused exclusively on the degree of novelty. While the synthesis is novel, the fundamental building blocks are well-established, and the paper could do more to situate its contributions within a broader history of these ideas.

Constituent Components are Not Fundamentally New: The central weakness from a novelty perspective is that the paper does not invent new primitives.
- CMQ is, at its core, an application of online k-means clustering. Hierarchical clustering to accelerate search (as seen in Figure 5b) is also a standard technique in the VQ and ANNS literature (e.g., hierarchical k-means, tree-based VQ). The novelty is purely in its application, not in the algorithm itself.
- FG-SLT is a combination of SLT [25, 52] and fine-grained structured sparsity, an idea heavily explored in works on model compression and efficient hardware design, such as NVIDIA's N:M sparsity and the cited DBB [38, 56]. The contribution is the successful merger, not the invention of either part.
Conceptual Overlap with Communication-Avoiding Algorithms: The CMQ idea bears a strong conceptual resemblance to decades of work in High-Performance Computing (HPC) and distributed systems on communication-avoiding algorithms. The principle of creating local, compressed "summaries" or "sketches" of remote data to reduce expensive communication is a classic pattern. While the specific mechanism (online VQ for GNN features) is new, the work would be stronger if it acknowledged and differentiated itself from this broader class of solutions.
The "Delta" is Primarily Application-Engineering: The primary contribution is identifying that technique 'A' (online VQ) and technique 'B' (SLT with structured sparsity) are effective solutions for problem 'C' (GNN acceleration bottlenecks). While this is valuable, the conceptual leap is not as significant as the invention of a new algorithmic primitive. The novelty is in the what and where of the application, not the how of the underlying mechanism.

Questions to Address In Rebuttal

Regarding CMQ: Could the authors elaborate on the novelty of CMQ beyond the direct application of online VQ to inter-partition GNN nodes? How does this method fundamentally differ from classic data sketching or lossy compression techniques used to reduce communication in other distributed computing domains?
Hierarchical CMQ: The two-level hierarchical structure for CMQ (Section 3.1, Figure 5b, page 5) is an important part of the implementation. Is this hierarchical approach itself a novel contribution, or is it an adaptation of well-known methods like hierarchical k-means or product quantization trees used to accelerate nearest centroid search?
Regarding FG-SLT: The core novelty claim is the combination of MSup [52] with FG structured pruning. Does the training algorithm fundamentally change how the supermasks are learned, or does it simply impose a structural constraint on the final mask selection (i.e., for each block, pick the top-k scores that satisfy the sparsity constraint)? Please clarify the precise algorithmic novelty beyond combining the two constraints.
Complexity vs. Benefit Trade-off: CMQ introduces non-trivial architectural complexity, including online clustering units, multiple codebook buffers, and update logic. Could the authors provide a brief analysis of whether there are graph structures or scenarios (e.g., graphs with very weak feature locality) where the overhead of maintaining the CMQ codebooks could potentially outweigh the benefits of reduced off-chip memory traffic?

LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization

Abstract

Recent advances in Protein Structure Prediction Models (PPMs), such as AlphaFold2 and ESMFold, have revolutionized computational biology by achieving unprecedented accuracy in predicting three-dimensional protein folding structures. However, these models ...

ACM LINK

Reviews

Review 1

Excellent. I will embody "The Guardian" and provide a rigorous, skeptical review of the paper "LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present LightNobel, a hardware-software co-design aimed at mitigating the scalability limitations of Protein Structure Prediction Models (PPMs) when processing long amino acid sequences. The core of their contribution is a software method, Token-wise Adaptive Activation Quantization (AAQ), which leverages observed statistical properties of PPM activations to apply a fine-grained, multi-precision quantization scheme. This is coupled with a proposed hardware accelerator featuring a Reconfigurable Matrix Processing Unit (RMPU) and a Versatile Vector Processing Unit (VVPU) designed to execute AAQ efficiently. The paper claims substantial improvements in speed (up to 8.44x), power efficiency (up to 43.35x), and a dramatic reduction in peak memory (up to 120.05x) over modern GPUs, with what is described as "negligible" accuracy loss. My analysis questions the validity of these claims based on the chosen baselines and evaluation methodology.

Strengths

Problem Identification and Motivation: The paper correctly identifies a critical bottleneck in modern PPMs. The analysis in Section 3, particularly Figures 3 and 4 (pages 3-4), effectively demonstrates that the Pair Representation dataflow and the associated quadratic growth in activation size are the primary impediments to scaling for long sequences. This motivation is clear and well-established.
Core Algorithmic Insight: The empirical analysis supporting the proposed AAQ method appears sound. The observation presented in Section 3.3 and Figure 5 (page 4)—that PPM activations exhibit high variance between tokens but relatively low variance between channels—is a non-trivial insight. This provides a solid rationale for choosing a token-wise over a channel-wise quantization strategy.
Systematic Quantization Design: The design space exploration for the AAQ quantization scheme in Section 7.1 and Figure 11 (page 10) is thorough. The authors provide a clear, data-driven justification for the specific quantization parameters (bit-widths, number of outliers) chosen for each activation group (A, B, and C). This systematic approach adds a degree of rigor to their algorithm design.

Weaknesses

My primary concerns with this paper relate to the experimental methodology, the fairness of the baseline comparisons, and the resulting inflation of performance claims.

Misleading and Unfair Performance Baselines: The headline performance claims are built on weak and, at times, inappropriate baselines.
- Speedup Claims: The reported speedups of up to 8.44x (Figure 14b, page 12) are derived from comparing LightNobel to GPU execution with chunking. The authors themselves acknowledge that chunking "significantly increases GPU latency due to kernel overhead" (Section 8.2, page 12). This is a known trade-off, and using this intentionally slower configuration as the primary basis for speedup claims is misleading. The more direct comparison against GPUs running without chunking (Figure 14c, page 12) reveals a much more modest speedup of 1.19x-2.42x. This suggests that the accelerator's core computational advantage is far less significant than claimed.
- Peak Memory Claims: The most striking claim of a 120.05x peak memory reduction is fundamentally flawed. This figure is obtained by comparing LightNobel to the theoretical memory requirement of a GPU executing without chunking on a protein so large that the GPU would immediately run out of memory (Figure 15, page 12). This is not a practical or meaningful comparison. The relevant comparison is against a functional GPU baseline using chunking, where the authors' own data shows a far smaller (though still valuable) reduction of 1.26x–5.05x. The 120.05x number is an artifact of a strawman argument.
Significant Confounders in Hardware Evaluation: The hardware efficiency claims are confounded by fundamental differences in technology that are not adequately addressed.
- Process Node Disparity: The proposed accelerator is evaluated using synthesis on a 28nm process, while the baseline NVIDIA A100 and H100 GPUs are manufactured on 7nm and 4nm processes, respectively (Section 8.4, page 13). Comparing a specialized ASIC on a dated node to a general-purpose GPU on a cutting-edge node makes it impossible to isolate the architectural contribution of LightNobel from the well-understood intrinsic efficiency advantage of ASICs and the process technology gap. The claims of 37-43x higher power efficiency are therefore not rigorously supported as an architectural innovation. A normalized comparison (e.g., using established scaling models) is required for a valid claim.
- Simulation vs. Silicon: The entire system performance is evaluated using a Python-based cycle-accurate simulator, not a physical prototype or even a full-system FPGA emulation (Section 6, page 10). The reported discrepancy of up to 4.63% between the simulator and isolated RTL modules raises concerns about the fidelity of the end-to-end performance results, which depend on complex interactions that may not be perfectly modeled.
Insufficient Analysis of Accuracy Degradation: The claim of "negligible accuracy loss" is based solely on a change in average TM-Score of less than 0.001 (Figure 13, page 11). In biological and pharmaceutical applications, worst-case performance is often more critical than average performance. An averaging approach can easily mask catastrophic failures on specific protein structures or families. The paper provides no analysis of the distribution, variance, or worst-case degradation of the TM-Score. Without this, the claim that accuracy is preserved remains unsubstantiated for practical use cases.

Questions to Address In Rebuttal

Please justify using the slower, chunked GPU implementation as the primary baseline for the headline speedup claims. Given the significant discrepancy, a revised discussion focusing on the more direct comparison against the non-chunked GPU baseline (where possible) is necessary to fairly represent the accelerator's computational advantage.
Regarding the 120.05x peak memory reduction claim: Please clarify that this number is derived from a comparison against a non-functional baseline (OOM GPU). Please give more prominence to the 1.26x-5.05x reduction against the functional, chunked GPU baseline, as this represents the practical improvement.
How would the reported power and area efficiency claims change if the LightNobel design were normalized to the same 4nm process technology as the H100 GPU? Without this normalization, how can the authors decouple their architectural contributions from the known, generic benefits of ASIC implementation?
Beyond average TM-Score, please provide data on the distribution of accuracy changes. Specifically, what is the standard deviation of the TM-Score delta, and what is the single worst-case accuracy degradation observed across all tested proteins in the CAMEO, CASP14, and CASP15 datasets?
The proposed AAQ method combines three distinct techniques: token-wise grouping, dynamic outlier handling, and multi-precision inliers. Can you provide an ablation study that isolates the impact of each of these components on both final accuracy (TM-Score) and memory footprint? This is essential for understanding which parts of your proposal contribute most to the results.

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents LightNobel, a hardware-software co-designed solution aimed at overcoming the severe scalability limitations of Protein Structure Prediction Models (PPMs) like AlphaFold2 and ESMFold when processing proteins with long amino acid sequences. The authors identify the exponential growth of activation memory, specifically within the "Pair Representation" dataflow, as the primary bottleneck.

Their core contribution is a two-pronged solution. On the software side, they propose "Token-wise Adaptive Activation Quantization" (AAQ), a novel quantization scheme tailored to the unique data characteristics of PPM activations. Instead of a one-size-fits-all approach, AAQ analyzes activation patterns (e.g., distograms) to dynamically apply multi-precision formats (INT4/INT8/INT16) and handle outliers on a per-token basis. On the hardware side, they design the LightNobel accelerator, featuring a Reconfigurable Matrix Processing Unit (RMPU) and a Versatile Vector Processing Unit (VVPU), specifically engineered to efficiently execute the dynamic and multi-precision computations required by AAQ. The authors report significant improvements over high-end GPUs, including up to an 8.4x speedup and, critically, a reduction in peak memory requirements of up to 120x, enabling the analysis of much longer protein sequences on a single device.

Strengths

Tackling a Critical, High-Impact Problem: The paper addresses a well-known and significant barrier in computational biology. The success of AlphaFold2 has opened the door to structural proteomics, but its computational and memory demands for large proteins and multi-protein complexes are a major hindrance. This work is not an abstract optimization; it is a direct attempt to solve a real-world problem at the frontier of science.
Excellent Problem Characterization and Insight: The motivation section (Section 3, page 3-4) is particularly strong. The authors correctly identify that for PPMs, unlike for many large language models, the bottleneck is the explosion in activation size, not weight size (Figure 4). This insight justifies their entire approach. Furthermore, their identification of the "Token-Wise Distogram Pattern" (Section 3.3, page 4) is a key observation that provides a principled basis for their quantization scheme. This demonstrates a deep understanding of the target workload.
A Genuine Hardware-Software Co-Design: This is a prime example of effective co-design. The software algorithm (AAQ) presents challenges for general-purpose hardware due to its dynamic nature, multi-precision requirements, and per-token operations. The LightNobel architecture is explicitly designed to meet these challenges. The RMPU's reconfigurability to handle mixed-precision dot products and the VVPU's support for dynamic top-k outlier selection are direct hardware responses to the needs of the AAQ algorithm. This tight coupling is the paper's greatest strength.
Significant Potential for Enabling New Science: The most compelling result is the dramatic reduction in peak memory (Figure 15, page 12). By enabling the processing of sequences up to ~10,000 amino acids, LightNobel could make the routine structural analysis of very large proteins (like titin) or complex biological assemblies computationally tractable on a single accelerator. This is a capability leap, not just an incremental speedup, and could directly enable new biological discoveries that are currently out of reach.

Weaknesses

Limited Contextualization Against Non-Quantization Scaling Methods: The paper positions itself well against other quantization work but could strengthen its argument by discussing alternative scaling strategies more thoroughly. Methods like memory-efficient attention mechanisms (as mentioned with LMA), sparsity, or low-rank factorization are also being explored to reduce the quadratic complexity of attention. A more detailed comparison, even a conceptual one, would help readers understand where AAQ and LightNobel fit within the broader landscape of solutions for scaling attention-based models.
Generalizability of the Core Assumptions: The AAQ method is built on the observed characteristics of activations in Evoformer-style backbones (ESMFold, AlphaFold2). However, the field is evolving rapidly; for example, AlphaFold3 incorporates a Diffusion model. How robust are the core assumptions of AAQ to fundamentally different model architectures? A discussion on the potential generalizability of their insights—or the limits thereof—would significantly increase the long-term relevance of the work.
Hardware Complexity vs. Simpler Alternatives: The proposed RMPU is highly flexible and reconfigurable. This complexity is justified by the need to support the AAQ algorithm. However, it would be insightful to understand the trade-offs more clearly. For instance, what would be the performance and accuracy degradation if a simpler, non-adaptive token-wise quantization scheme (e.g., static INT8 for all tokens) were run on the same hardware? This would help isolate the specific benefits derived from the "adaptive" aspect of AAQ and the associated hardware complexity it necessitates.

Questions to Address In Rebuttal

Could the authors elaborate on how their approach compares to algorithmic scaling methods beyond chunking? For example, how does the efficiency gain from quantization compare to potential gains from methods that fundamentally alter the computation, such as Linformer or other sparse/efficient attention variants, if they were applied to PPMs?
The core insight of AAQ relies on the specific characteristics of the Pair Representation in models like ESMFold. With the recent release of AlphaFold3, which uses a different architectural paradigm, have the authors considered if their token-wise adaptive approach would still be applicable or what modifications might be needed?
To better understand the contribution of the adaptivity in AAQ, could you provide an ablation study? Specifically, what is the accuracy and performance (or memory footprint) if a non-adaptive but still token-wise scheme (e.g., uniform INT8 or INT4 for all tokens in a given group from Section 4.2) is implemented? This would help quantify the value added by the dynamic precision and outlier handling.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents LightNobel, a hardware-software co-designed accelerator aimed at mitigating the sequence length limitation in Protein Structure Prediction Models (PPMs). The core problem, the exponential growth of activation memory due to the (Ns, Ns, Hz) Pair Representation tensor, is well-established. The authors propose a software solution, Token-wise Adaptive Activation Quantization (AAQ), and a tightly coupled hardware architecture featuring a Reconfigurable Matrix Processing Unit (RMPU) and a Versatile Vector Processing Unit (VVPU) to execute it efficiently.

The central novel claim is the formulation and co-design of this specific AAQ scheme. AAQ is predicated on the insight that PPM activations exhibit high variance between tokens but low variance between channels within a token (Section 3.3, page 4), justifying a token-wise quantization strategy. The "adaptive" nature of AAQ involves: 1) classifying activations into three groups with different precision and outlier handling requirements, and 2) using a dynamic, runtime top-k algorithm to identify a variable number of outliers for each token. The hardware's claimed novelty lies in its specific features designed to support this complex scheme, such as the RMPU's ability to handle multi-precision data without full dequantization and the VVPU's native support for runtime quantization and top-k selection.

Strengths

Novel Problem-Specific Insight: The key insight that justifies the entire approach—the analysis of token-wise versus channel-wise variance in PPM activations (Figure 5, page 4)—is a strong and novel contribution. While attention models are not new, this specific characterization of the Pair Representation data structure in the PPM domain appears to be original and provides a solid foundation for the proposed algorithmic changes.
Synthesis of Quantization Techniques: The AAQ algorithm is not a single new idea but a novel synthesis of several advanced quantization concepts. While token-wise quantization, outlier handling, and mixed precision have been explored individually, their combination here is unique. Specifically, the dynamic nature of both the precision (based on activation groups A, B, C in Section 4.2, page 6) and the outlier handling (runtime top-k) within a single, unified framework is a new formulation.
Tightly-Coupled Co-Design: The work presents a strong case for co-design. The proposed AAQ algorithm would be prohibitively inefficient on general-purpose hardware like GPUs. The hardware architecture, particularly the Dynamic Accumulation Logic (DAL) within the RMPU (Figure 9, page 8), is a non-trivial and novel design specifically tailored to handle the dynamic, multi-precision computations required by AAQ. This is not simply mapping an algorithm to an FPGA; it is a ground-up hardware design dictated by the algorithm's unique requirements. The integration of a top-k sorter into the VVPU for runtime outlier identification is another clear example of this co-design novelty.

Weaknesses

Novelty of Individual Primitives is Limited: While the synthesis is novel, the constituent parts of the solution have precedents in prior art. The authors should be more explicit about this.
- Token-wise Quantization: The concept itself is not new. LLM.int8() [17] identifies and handles outliers based on per-token statistics. The novelty here is not the invention of token-wise quantization, but its application and justification in the PPM domain where the Ns >> Hz characteristic makes it uniquely suitable.
- Bit-level Reconfiguration: The RMPU's strategy of breaking data into bit-chunks and using reconfigurable units has conceptual overlap with prior work like Bit Fusion [57]. The paper's novelty is in the specific implementation (the DAL) for its unique multi-precision outlier/inlier accumulation task.
- Hardware Top-k Selection: Efficient hardware for sorting and top-k selection is a well-studied field [56]. The contribution is its integration into the quantization pipeline, which is made viable by the small hidden dimension (Hz=128) of PPMs.
Narrow Applicability: The core insight and resulting architecture are highly specialized to the unique (Ns, Ns, Hz) data structure of PPMs. This raises questions about the broader novelty and impact of the architectural ideas. The design choices (e.g., token-wise parallelism, feasibility of runtime top-k) are justified by Ns being large while Hz is small. This makes the approach less applicable to mainstream LLMs where the opposite is true. The paper presents a point solution, albeit an innovative one.

Questions to Address In Rebuttal

Clarification of Novelty Delta: Please explicitly articulate the delta between AAQ's token-wise scheme and the per-token outlier handling in prior works like LLM.int8() [17]. Is the primary novelty the application to a new domain, the dynamic top-k mechanism, the adaptive precision, or a combination thereof?
Justification for Dynamic Complexity: The runtime top-k selection and the Dynamic Accumulation Logic (DAL) introduce significant control logic and hardware complexity. Have the authors explored a simpler, static approach (e.g., a fixed number of outliers per token, or a static mapping of precisions) and quantified the accuracy loss? This is crucial to justify the introduction of such a complex dynamic mechanism. Is the observed performance gain primarily from the token-wise approach itself, or is the added complexity of "adaptive" and "dynamic" features essential?
Generality of the Architecture: Given that the architecture's efficiency is predicated on the Ns >> Hz property of PPMs, can the authors comment on the applicability of the RMPU/VVPU design to other model classes? Does any part of the hardware design represent a more general-purpose innovation for accelerating multi-precision workloads, or is its novelty inextricably tied to the PPM application?

InfiniMind: A Learning-Optimized Large-Scale Brain-Computer Interface

Abstract

Brain- computer interfaces (BCIs) provide an interactive closed-loop connection between the brain and a computer. By employing signal processors implanted within the brain, BCIs are driving innovations across various fields in neuroscience and medicine. ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the perspective of "The Guardian."

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present InfiniMind, a system designed to mitigate the performance and lifetime degradation caused by write-intensive continual learning algorithms on NVM-assisted, implantable Brain-Computer Interfaces (BCIs). The work identifies that learning workloads introduce significant NVM write overhead, which is a critical bottleneck. To address this, they propose a suite of four optimization techniques integrated into a memory controller: update filtering, delta buffering, out-of-place flushing, and waveform compression. These are implemented and evaluated within the architectural framework of a prior state-of-the-art system, SCALO. The authors claim substantial improvements in performance (5.39x) and NVM lifetime (23.52x) on representative BCI learning tasks.

While the problem is well-motivated and significant, the proposed solutions appear to be an application of well-established techniques from storage and memory systems to the BCI domain. The evaluation, while extensive, relies on artificially constructed datasets and a potentially weak baseline, which may inflate the reported benefits. The novelty of the core mechanisms is questionable, and several critical system-level effects, such as garbage collection overhead, are insufficiently addressed.

Strengths

Problem Identification: The paper does an excellent job of identifying and motivating a critical, forward-looking problem. The analysis in Section 2, particularly the data presented in Figure 1, clearly illustrates the prohibitive cost of enabling continual learning on existing NVM-assisted BCI architectures. This is a timely and important contribution.
Systematic Approach: The work is well-structured. The authors systematically derive four key observations about BCI workloads (Section 3) and map each observation directly to a proposed hardware solution (Section 4). This logical progression makes the paper easy to follow.
End-to-End Integration: The authors implement their proposed techniques within a detailed, multi-accelerator system model (SCALO). This end-to-end approach is more compelling than a standalone memory controller simulation, as it accounts for interactions with processing elements (PEs) and the network-on-chip (NoC).

Weaknesses

Questionable Novelty of Core Techniques: The primary weakness of this work lies in the originality of the proposed solutions. Each of the four optimizations is a well-known concept:
- Update Filtering: Filtering small or redundant updates is a standard technique in fields ranging from sparse neural network training to signal processing. The contribution here is merely the application, not the invention, of a thresholding mechanism.
- Delta Buffering & Out-of-Place Flushing: This is fundamentally a combination of write buffering/caching and a log-structured file system (LFS) approach, as the authors themselves note in Section 4.3 by citing LFS. These are foundational concepts in storage systems. The custom mapping table (Section 4.2) seems more of an implementation detail necessitated by the BCI data structures rather than a fundamental contribution.
- Waveform Compression: The proposed algorithm (Section 4.4) appears to be a specific instantiation of run-length encoding combined with a simple unary-binary prefix code, applied after a lossy averaging step. The claim that existing algorithms are "inefficient" (Section 3.4) is not sufficiently substantiated, and it is unclear if the proposed custom scheme offers a fundamentally better trade-off than tuned versions of existing lossy signal compression algorithms.
Flawed Evaluation Methodology: The experimental validation raises several red flags that call the results into question:
- Artificial Dataset Scaling: Per Section 5.1, the authors "spatially concatenate" smaller datasets to create a synthetic 1,024-channel workload. This methodology is highly suspect. It fails to model the complex spatiotemporal correlations that would exist in a genuine large-scale recording. The observed locality (Key Observation #2) could be an artifact of this concatenation rather than a true property of large-scale neural signals. This fundamentally undermines the external validity of the results.
- Weak Baseline: The baseline system is SCALO with learning naively enabled, leading to massive write amplification and latency. This appears to be a strawman argument. A more rigorous baseline would incorporate a standard, albeit simple, write buffer, which is common practice in NVM controllers. The reported speedups are likely magnified by the weakness of this baseline.
- Parameter Tuning: The system contains numerous critical parameters (e.g., filtering thresholds, 72 KB buffer size, LFU policy). The paper presents sensitivity studies in Section 5.5, but these appear to be post-hoc justifications for pre-selected values rather than a principled exploration of the design space. For example, the filtering thresholds seem to be chosen specifically to keep the accuracy drop below a certain level (e.g., <0.55% in Section 3.1), which verges on circular reasoning.
Insufficient Analysis of System Overheads:
- Garbage Collection (GC): The out-of-place flushing mechanism will inevitably fragment the NVM and require garbage collection. The paper dedicates only a few sentences in Section 4.5.1 to a "greedy garbage collection strategy" and "dynamic wear leveling." This is a critical omission. GC can introduce significant tail latency and power overhead, which could easily negate some of the claimed benefits. Without a quantitative evaluation of GC costs, the performance claims are incomplete.
- Hyperbolic Claims: The title "InfiniMind" is an overstatement. The system extends lifetime, but it does not make it infinite. The results in Section 5.2 show lifetimes of 3.12 years for SS and less than a year for GRU, which still fall short of the 10-year target for clinical implants mentioned in Section 2.1. The claim of being the "first" to address this issue (Abstract) is also very strong and may overlook more specialized prior work.

Questions to Address In Rebuttal

Please provide a stronger justification for the artificial dataset scaling methodology. How can you assure the reviewers that the observed signal properties (particularly temporal locality) are representative of true large-scale neural recordings and not artifacts of the concatenation process?
Can you defend the choice of the baseline system? Please provide data comparing your system to a more robust baseline that includes a conventional page-level write buffer.
The paper lacks a quantitative analysis of the garbage collection mechanism necessitated by the out-of-place flushing scheme. Please provide data on the performance (especially tail latency), power, and NVM write overheads introduced by your GC strategy.
The accuracy preservation after aggressive filtering (e.g., filtering 95% of gradient updates for GRU with a <0.55% accuracy drop) is a remarkable claim. Please provide more direct evidence to support this, such as plots showing the learning convergence curves over time for the baseline versus the filtered models for all benchmarks.
Please clarify the novelty of the waveform compression algorithm. Provide a head-to-head comparison against other modern lossy time-series compression algorithms (beyond the general LZ family) configured for the same small block sizes and accuracy targets.

Review 2

Of course. Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents InfiniMind, a system architecture for large-scale, implantable Brain-Computer Interfaces (BCIs) that is specifically optimized for on-device continual learning. The authors correctly identify a critical emerging problem: as implantable BCIs scale up, they increasingly rely on Non-Volatile Memory (NVM) for its density and low static power, but the high write overhead (in terms of latency, power, and endurance) of NVM makes the write-intensive process of continual learning impractical.

The core contribution is a suite of four synergistic hardware optimizations—update filtering, delta buffering, out-of-place flushing, and waveform compression—that are deeply motivated by the specific statistical properties of neural signals and BCI learning workloads. By integrating these techniques into a state-of-the-art NVM-assisted BCI processor (SCALO), the authors demonstrate a holistic, end-to-end system that dramatically improves performance (5.39x) and extends device lifetime (23.52x), thereby making long-term, adaptive, implanted BCIs architecturally feasible.

Strengths

Excellent Problem Formulation and Contextualization: The paper tackles a timely and critical problem at the confluence of several fields: neurotechnology, computer architecture, and machine learning. The authors do an excellent job motivating the need for both large-scale recording (necessitating NVM) and continual learning (to combat signal non-stationarity, as shown in Section 2.2, Page 3). This work is the logical and necessary next step for systems like SCALO [108], addressing the key bottleneck that prevents such architectures from supporting adaptation.
Strong Synthesis of Ideas from Disparate Fields: The true strength of this paper lies in its synthesis. The authors have effectively taken well-established principles from the world of storage systems and databases and reapplied them in the novel, highly-constrained context of an implantable medical device. For instance, the "delta buffering" and "out-of-place flushing" (Sections 4.2 and 4.3, Pages 6-8) are direct conceptual descendants of write buffering schemes and log-structured file systems (LFS) [95], which have been used for decades to mitigate the high cost of random writes in block-based storage. Applying these mature ideas to the unique data patterns of BCI workloads is a powerful contribution.
Deeply Motivated, Workload-Specific Optimizations: The proposed solutions are not generic; they are meticulously tailored to the domain. The analysis in Section 3 ("Key Observations & Design Goals," Page 5) is the foundation of the paper's strength. By identifying the recurrence and sparsity of neural signals to justify update filtering, the temporal locality to justify buffering, and the structural patterns to justify a custom compression algorithm, the authors build a compelling case that their architectural choices are grounded in the fundamental nature of the data. This elevates the work from a simple application of known techniques to a thoughtful, domain-specific co-design.
Holistic System-Level Approach: The authors did not stop at proposing a set of disjointed optimizations. They demonstrate a complete, end-to-end system by integrating their ideas into SCALO. This includes thoughtful modifications to the memory controller, the processing elements (PEs), and the Network-on-Chip (NoC), as detailed in Section 4.5 (Page 9). The design of a lightweight FTL and a dynamic NoC to handle variable-latency memory access shows a mature understanding of the system-level challenges involved. This completeness significantly increases the credibility and potential impact of the work.

Weaknesses

Understated Connection to Prior Art in Storage Systems: While I view the synthesis of ideas as a major strength, the paper could do more to explicitly situate its techniques within the broader history of write mitigation in storage and memory systems. The LFS inspiration is mentioned, but the decades of research on NVM write reduction, buffer management policies (e.g., CFLRU [88], BPLRU [63]), and data compression for flash memory [89] provide a rich context that is largely untapped. A more thorough discussion would not diminish the novelty but would instead highlight how the authors have skillfully adapted these general principles to the specific constraints (e.g., extreme power limits, real-time latency) of implantable BCIs.
Limited Exploration of Learning Timescales: The paper primarily focuses on the most demanding learning scenario: real-time, sample-by-sample recalibration. This is an important and challenging case. However, as the discussion in Section 6 (Page 13) briefly alludes to, continual learning in BCIs can occur on multiple timescales (e.g., minutes, hours, or daily). The current architecture seems optimized for the "fast and frequent" update pattern. It is less clear how the trade-offs would change for "slower, larger-batch" learning paradigms. For instance, would the delta buffer still be as effective, or would a different approach be needed? Exploring this aspect would broaden the applicability of the work.
Potential for Broader Impact Beyond BCIs: The optimizations are justified based on neural signals, but the underlying signal characteristics (sparsity, temporal locality, repeating waveforms) are common to many other biomedical time-series, such as electrocardiography (ECG) or electromyography (EMG). The paper could be strengthened by a brief discussion on the potential generalizability of the InfiniMind architecture to other long-term, implantable "smart" sensing and therapy devices that require on-device learning. This would help position the contribution not just as a solution for BCIs, but as a potential architectural template for a wider class of medical implants.

Questions to Address In Rebuttal

The core architectural ideas (buffering, out-of-place updates) have strong parallels in the storage systems community. Could the authors elaborate on what makes their implementation uniquely suited for the BCI domain, beyond the workload analysis? For example, how does the design of your application-specific mapping table and hierarchical structure (Section 4.2, Page 7) differ from and improve upon more traditional page- or block-level mapping in FTLs for this specific application?
The evaluation focuses on real-time learning. How would the InfiniMind architecture and its constituent optimizations perform under a different learning model, such as one where data is batched over several minutes and learning occurs in larger, more sporadic bursts? Would the 72 KB delta buffer be sufficient, or would the system's efficiency degrade?
The proposed optimizations incur a non-trivial area overhead of around 30% (Section 5.4, Page 12). While the lifetime and performance gains are significant, in the context of an implantable device, both area and power are critical. Could you provide more insight into this trade-off? Is this overhead the minimum required to achieve the reported benefits, or are there paths to a more area-efficient implementation, perhaps by sacrificing some performance or lifetime extension?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents "InfiniMind," a system architecture designed to enable continual learning on implantable, NVM-assisted Brain-Computer Interfaces (BCIs). The central problem addressed is the severe performance degradation and limited device lifetime caused by the high volume of write operations inherent in learning algorithms when mapped to NVMs like NAND Flash. The authors propose a suite of four hardware optimizations integrated into a memory controller: (1) Update Filtering to discard redundant parameter updates, (2) Delta Buffering to cache frequently modified data segments, (3) Out-of-Place Flushing to reduce write amplification, and (4) Waveform Compression to decrease the data volume written to NVM. The authors claim this is the first work to holistically address learning on NVM-assisted BCIs, and demonstrate significant improvements in performance (5.39x) and lifetime (23.52x).

The core novel claim is not the invention of fundamentally new architectural primitives, but rather the synthesis, adaptation, and integration of several established techniques from the fields of storage systems and databases into a cohesive solution for the unique problem domain of learning-enabled, large-scale BCIs.

Strengths

Problem Identification: The paper correctly identifies a critical and forward-looking problem. While NVM-assisted BCI inference systems like SCALO exist, the architectural challenges of supporting write-intensive continual learning in this power- and area-constrained domain are novel and significant. The claim to be among the first to tackle this specific intersection appears credible.
Holistic System Integration: The strength of the work lies in its end-to-end system design. Rather than proposing a single, isolated optimization, the authors have engineered four complementary techniques that work in concert. The modifications to the memory controller, PEs, and NoC show a thoughtful approach to system-level integration.
Justified Complexity: The proposed architecture introduces non-trivial complexity (e.g., the hierarchical mapping table, compression units). However, the reported performance and lifetime gains are substantial, not marginal. This makes a compelling case that the added complexity is justified to make a previously impractical application (real-time learning on NVM-BCIs) viable.

Weaknesses

The primary weakness of this paper, when viewed through the lens of pure novelty, is that the foundational ideas behind each of the four proposed optimizations are well-established in prior art from other domains. The contribution is one of engineering and application, not fundamental invention.

Adaptation of Known Techniques:
- Out-of-Place Flushing: The authors explicitly state this is "motivated by the log-structured file system (LFS) [95]" (Section 4.3, Page 8). The concept of redirecting writes to new locations to avoid overwrites and reduce amplification is the core principle of LFS and modern SSD FTLs. This is a direct adaptation.
- Delta Buffering: Buffering only the modified "deltas" of data instead of full pages is a classic technique in database logging and storage caching. The paper's own related work section (Section 6, Page 14) acknowledges prior work in page-level [51, 88] and block-level [63] write buffer management.
- Update Filtering: The concept of filtering updates based on magnitude or significance is analogous to gradient sparsification or thresholding techniques used to reduce communication and computation in distributed machine learning and neuromorphic systems. The novelty is in applying it to BCI signal characteristics, but the principle is not new.
- Waveform Compression: The proposed algorithm is a pipeline of known primitives: temporal aggregation, run-length encoding, and unary-binary encoding (a form of variable-length coding). While tailored for BCI waveforms, it does not represent a novel compression algorithm in itself.
Overstated Novelty in Framing: The abstract's claim that "This work is the first to address the performance and lifetime issues of deploying learning on NVM-assisted BCI systems" is strong, but the novelty of the solutions themselves could be framed more precisely. The paper would be stronger if it explicitly positioned its contribution as "the first successful adaptation and synthesis of storage system techniques to solve the write-endurance problem for learning-enabled BCIs," rather than implying the techniques themselves are novel creations.

Questions to Address In Rebuttal

The core architectural primitives (delta buffering, out-of-place writes) are conceptually similar to techniques in storage systems. Please explicitly state what is fundamentally new in your implementation of these techniques that is necessitated by the BCI domain and would not be found in a conventional SSD FTL or database buffer manager. For example, how does the hierarchical mapping table (Section 4.2, Page 7) specifically handle BCI data structures (e.g., dynamic cluster management) in a way that a generic logical-to-physical mapping could not?
The authors claim this is the "first work to address the performance and lifetime issues..." (Abstract, Page 1). While the focus on BCIs is specific, the general problem of enabling learning on NVM-based, resource-constrained embedded systems is broader. Can the authors contrast their approach with any prior art in, for example, IoT or edge AI systems that also use NVM and must contend with write-intensive online learning tasks?
The novelty appears to lie in the co-design of the memory controller with the application's data structures and access patterns. Could the authors quantify how much of the benefit comes from this application-specific co-design versus what could be achieved by using a more generic, state-of-the-art FTL designed for small, random writes?

Adaptive CHERI Compartmentalization for Heterogeneous Accelerators

Abstract

Hardware accelerators offer high performance and energy efficiency for specific tasks compared to general-purpose processors. However, current hardware accelerator designs focus primarily on performance, overlooking security. This poses significant ...

ACM LINK

Reviews

Review 1

Here is the review based on the persona of 'The Guardian'.

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose "CapChecker," a hardware module intended to extend the CHERI capability security model to encompass CHERI-unaware hardware accelerators. The stated goal is to provide fine-grained, pointer-level memory protection for accelerator memory accesses without requiring architectural modifications to the accelerators themselves. The proposed system interposes on the accelerator's memory interface, checking DMA requests against a table of CHERI capabilities provided by the CPU. The authors evaluate this approach using a set of HLS-generated benchmarks on an FPGA platform, claiming an average performance overhead of 1.4% while providing significantly stronger security guarantees than traditional IOMMUs.

Strengths

The paper identifies a critical and timely problem: the security gap created by integrating non-CHERI-aware accelerators into a CHERI-protected system.
The comparison in Section 6.4 (Figure 12, page 12) effectively illustrates the scalability advantage of a pointer-based checking mechanism over a page-based IOMMU in terms of the number of required translation/protection entries.

Weaknesses

The paper's claims rest on a foundation of questionable assumptions and an evaluation that lacks critical details, casting significant doubt on the practicality and security of the proposed solution.

Unsupported "No Modification" Claim and Unrealistic Hardware Assumptions: The central premise of operating "without modifying accelerator architectures" (Abstract, page 1) is not upheld.
- The "Coarse" implementation (Figure 5, page 6) requires reserving the upper bits of the memory address to encode a pointer ID. This constitutes a modification of the accelerator's memory interface contract and reduces its addressable memory space. This is not "without modification."
- The "Fine" implementation, which provides the paper's strongest security claims (object-level protection), relies on an even more dubious assumption: that an accelerator's memory requests inherently carry a "pointer ID" that can be extracted to index the capability table. The paper suggests this could come from using separate hardware interfaces for each buffer (Section 5.2.2, page 7), which is a highly restrictive and unrealistic model for most real-world accelerators that use one or a few programmable DMA engines. The authors provide no evidence that the Vitis HLS toolflow used in their evaluation generates accelerators that meet this structural requirement.
Severely Limited Scope and Overstated Generality: The work is hamstrung by Assumption 2 (Section 4.1, page 4), which excludes any accelerator that performs dynamic memory management. This immediately rules out the most significant and complex classes of accelerators, such as GPUs and modern TPUs. Claiming this is a "general method" (Abstract, page 1) is therefore a significant overstatement. The problem is simplified to the point that its solution may not be relevant to the systems where such protection is most needed.
Incomplete and Ambiguous Evaluation: The experimental evaluation fails to substantiate the paper's security claims.
- Crucial Missing Data: The authors evaluate two distinct modes, Fine and Coarse, which offer vastly different security guarantees (object-level vs. task-level). However, the paper never states which mode was used for which benchmark in the performance evaluation (Section 6.3, pages 11-12). Given the likely architecture of HLS-generated accelerators, it is plausible that all benchmarks were evaluated using the far weaker Coarse mode. If this is the case, then all claims related to achieving fine-grained, object-level protection are entirely unsupported by the experimental results.
- Confusing Presentation: Figure 10 (page 11) is poorly explained. The terms ACPU and AACCEL are not defined in the caption or text, forcing the reviewer to guess their meaning. This is unacceptable for a quantitative results section.
- Temporal Safety is Ignored: The paper dismisses temporal memory safety issues (e.g., use-after-free) by stating it relies on a trusted software driver (Section 6.2, page 9). This is a major weakness. A comprehensive hardware security solution should not simply delegate an entire class of critical memory vulnerabilities to the assumption of perfect software.
Glossed-over Implementation Details: The design of the CapChecker itself raises unanswered questions. The paper mentions that the capability table may become full, requiring eviction and introducing a "potential for deadlock" (Section 5.2.3, page 7). This critical issue is mentioned and then immediately dismissed. What is the eviction policy? How is deadlock provably avoided? Without these details, the robustness of the hardware design is unknown.

Questions to Address In Rebuttal

For the evaluation presented in Section 6, please state explicitly for each benchmark whether the Fine or Coarse implementation of CapChecker was used. If Fine mode was used for any benchmark, provide evidence from the generated hardware that the accelerator possesses the distinct hardware interfaces necessary to unambiguously supply a "pointer ID" for each memory buffer.
How do you reconcile the claim of "without modifying accelerator architectures" with the Coarse mode's requirement to partition the address space, a clear modification of the accelerator's interface to the system?
The threat model (Section 4.1, page 4) states the "driver cannot use accelerators to do anything it could not already do." Is this claim enforced by a hardware mechanism, or does it rely solely on the assumption of a bug-free, trusted software driver? If the latter, how does this provide a meaningful security improvement over a non-CHERI system that also relies on a trusted driver?
Please provide a detailed description of the capability table eviction policy and a rigorous argument for how your design avoids deadlock when the table is full and multiple accelerator tasks are stalled pending allocation.
Given that your solution explicitly excludes accelerators with dynamic memory management and defers all temporal safety to software, can you defend the assertion that this is a "general" and "efficient" solution for compartmentalization in modern heterogeneous systems?

Review 2

Here is the review of the paper, written from the perspective of 'The Synthesizer'.

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a novel and practical method for extending fine-grained, capability-based memory protection to hardware accelerators in heterogeneous systems. The core contribution is "CapChecker," a hardware module that acts as an adaptive interface between a CHERI-enabled CPU and CHERI-unaware accelerators. Without requiring any modification to the accelerator's internal architecture, CapChecker intercepts memory requests, validates them against CPU-provided capabilities at the pointer level, and thus enforces the CHERI security model on devices that are natively oblivious to it. This work proposes a general and scalable solution to unify memory safety across the entire SoC, addressing a significant and growing security gap in modern computing systems. The authors demonstrate their approach with a prototype system, showing an average performance overhead of only 1.4% across a diverse set of accelerator benchmarks.

Strengths

This paper's primary strength lies in its elegant synthesis of existing concepts to solve a pressing, real-world problem. It cleverly bridges the gap between the mature, CPU-centric CHERI security model and the often-insecure world of "black-box" hardware accelerators.

Addresses a Critical Problem with a General Solution: The proliferation of accelerators in systems ranging from embedded SoCs to data centers has created a significant attack surface. Existing solutions like IOMMUs offer coarse, page-level protection, while bespoke secure accelerators are not generalizable. This paper tackles the problem head-on by proposing a single, unified protection mechanism that is agnostic to the specific accelerator architecture. The "wrap, don't rewrite" philosophy is immensely practical and lowers the barrier to adoption.
Excellent Contextual Framing: The work is well-positioned within the landscape of hardware security. The authors clearly articulate the limitations of current approaches—the granularity issues of IOMMUs and the heterogeneity problem of mismatched protection schemes (as shown in Figure 1 on page 2). Their proposal directly addresses this architectural mismatch, which is a key insight.
Strong Architectural Concept: The CapChecker is a compelling architectural pattern. By acting as a security-aware interposer, it allows the trusted CPU to serve as the authority for memory permissions, effectively delegating and enforcing those permissions on behalf of untrusted or unaware accelerators. This is a powerful model for integrating legacy or third-party IP into a secure system.
Promising and Well-Executed Evaluation: The evaluation provides strong evidence for the viability of the approach. Using the standard MachSuite benchmark set on an FPGA platform demonstrates applicability to a diverse range of accelerator behaviors. The key result—a 1.4% average performance overhead—is excellent and suggests that this level of fine-grained security is affordable. The scalability analysis comparing the number of required CapChecker entries to IOMMU entries (Figure 12, page 12) is particularly insightful, highlighting the efficiency benefits of pointer-level granularity over page-level granularity for typical accelerator workloads.

Weaknesses

The paper's weaknesses are primarily related to simplifying assumptions that bound the scope of the work. While acceptable for a foundational paper, they are important to acknowledge for the broader context.

Exclusion of Complex Accelerators: The most significant limitation is Assumption 2 in the threat model (Section 4.1, page 4), which states that accelerators do not perform dynamic memory management. This is a reasonable starting point, as it covers many common accelerators. However, it explicitly excludes some of the most powerful and security-critical accelerators, such as GPUs, TPUs, and modern programmable NPUs, which have their own complex memory management schemes. The proposed model, where the CPU allocates all memory, would not directly apply.
Fragility of the "Coarse" Provenance Mechanism: The Coarse implementation, which encodes an object ID in the upper bits of the memory address (Section 5.2.2, page 7), is a pragmatic but potentially fragile solution for retrofitting provenance. As the authors themselves note, it does not protect against intra-task attacks where a buffer overflow could be used to craft a valid-looking address that points to a different object within the same task's allowed memory regions. This weakens the object-level protection guarantees in the worst-case scenario.
Centralization of the Trusted Computing Base (TCB): The security of the entire system hinges on the correctness of the CapChecker hardware and its associated trusted software driver. While this is a standard system design trade-off, it concentrates risk. A vulnerability in the driver could be used to misconfigure the CapChecker, bypassing the protection for all accelerators.

Questions to Address In Rebuttal

The authors have presented a compelling vision for securing heterogeneous systems. I would appreciate their thoughts on the following points to better understand the future trajectory and robustness of this work.

Beyond the Current Threat Model: While the current work focuses on accelerators without dynamic memory management, this is a major frontier. Could the authors speculate on how the CapChecker concept might be evolved to support more complex accelerators like GPUs? Would this require a "co-design" where the GPU's memory manager interacts with the CapChecker via a standardized protocol, or could a purely hardware-based approach still be viable?
Robustness of Provenance: Regarding the Coarse mode, could you elaborate on the practicality of attacks that forge pointer IDs via overflows? How might this interact with software mitigations (e.g., placing guard regions between buffers) or potential hardware assists (e.g., a mode where the CapChecker enforces that accesses to different Pointer IDs are non-contiguous)?
Temporal Memory Safety: The current work focuses primarily on spatial memory safety. The trusted driver is responsible for managing allocation/deallocation, but this leaves open the possibility of use-after-free bugs if an accelerator continues to use a pointer after the driver has freed its memory. Could the CapChecker hardware be extended to aid in temporal safety, for instance, by supporting rapid revocation of capabilities in its table when memory is freed by the OS?
Performance at Scale: The evaluation shows excellent scalability with parallel tasks, but appears to use a single, shared CapChecker. In a future system with dozens of high-bandwidth accelerators, could this centralized checker become a performance or contention bottleneck? Have you considered a distributed or hierarchical CapChecker architecture to address this?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes an architectural component, the "CapChecker," designed to extend the fine-grained memory protection of CHERI CPUs to unmodified, "black-box" hardware accelerators. The core idea is to interpose a hardware module between the accelerator and the memory system. This module maintains a repository of CHERI capabilities provided by the CPU and validates every DMA request from the accelerator against these capabilities at runtime. The authors claim this provides pointer-level memory safety for heterogeneous systems without requiring any modification to the accelerator's internal architecture, achieving this with a reported average performance overhead of 1.4%.

Strengths

The primary strength of this work lies in its novel realization of a previously conceptual idea. My analysis identifies the following novel contributions:

From Position Paper to Concrete Architecture: The most significant contribution is the translation of a high-level concept into a concrete, implemented, and evaluated system. Prior work, notably the position paper by Markettos et al. [47] (which the authors correctly cite), proposed the general idea of using CHERI capabilities to protect against malicious DMA. However, that work lacked a microarchitectural design, a hardware implementation, and quantitative analysis. This paper provides all three, presenting the specific design of the CapChecker with its capability table, the MMIO-based control path, and the logic for checking requests. This transition from idea to working prototype is a clear and significant novel contribution.
A "Black-Box" Philosophy for Capability Systems: The architectural approach is novel in its explicit goal of securing unmodified accelerators. This contrasts with other recent work in accelerator security, such as sNPU [20], which integrates capability-like mechanisms directly into the accelerator's design (a "white-box" approach). The CapChecker proposes a fundamentally different, non-invasive integration pattern. This is a novel point in the design space of secure heterogeneous systems, offering a path to retrofit security onto legacy or third-party IP cores.
Demonstration of Practicality: The novelty is not just in the architecture itself, but in the demonstration that such an architecture is efficient. Achieving fine-grained, object-level protection for peripheral DMA with a performance overhead under 2% is a compelling result that establishes the viability of this new approach.

Weaknesses

My evaluation of novelty is not without reservations. The core idea, while implemented in a novel way, is built upon established concepts, and its claimed generality has notable limitations.

The Interposer Pattern is Not Novel: The fundamental concept of a hardware security monitor or interposer that sits between a peripheral and memory is not new. IOMMUs are the canonical example of this pattern, and other research has proposed various forms of DMA filtering or monitoring hardware. The novelty here is strictly limited to the type of policy being enforced (CHERI capabilities) rather than the architectural pattern of interposition itself.
The Coarse Mode Mechanism Lacks Novelty: The paper presents two implementations, Fine and Coarse (Figure 5, Page 6). The Coarse mode, proposed as a fallback when transaction provenance is unavailable, relies on reserving upper address bits to encode a pointer ID. This technique—using parts of the address bus for metadata or tagging—is a well-established practice in computer architecture and is not a novel contribution. Its inclusion weakens the overall novelty claim, as it suggests the "general" solution reverts to a common trick in its most challenging case.
Scoped Novelty via Restrictive Assumptions: The paper's core novelty is scoped to a specific and simplified class of accelerators due to Assumption 2 (Section 4.1, Page 4), which excludes accelerators performing their own dynamic memory management. This is a critical limitation. The proposed CapChecker architecture, with its CPU-managed static capability table, is not suited for more complex accelerators like GPUs or TPUs that manage their own memory. Therefore, the novel contribution does not address the security of the most complex and widely used accelerators but is instead confined to a simpler domain. The claim of a "general method" is thus overstated.

Questions to Address In Rebuttal

The authors should use the rebuttal to clarify the precise boundaries of their novel contributions.

Regarding the Coarse implementation: Can the authors cite prior work that uses address-bit tagging for peripheral memory protection and clearly articulate what, if anything, is novel about their specific use of this technique in the context of the CapChecker?
The paper's central claim is a method that works "without modifying accelerator architectures." How would the CapChecker design need to change to support an accelerator that internally performs dynamic memory allocation (e.g., contains a malloc engine for its own scratchpad or local memory)? Would this require a fundamentally new architecture, thereby invalidating the claimed generality, or are there incremental novel extensions to your design that could support this?
The software driver model described in Section 5.3 (Page 8) appears to be a standard implementation for managing a new hardware device. Is there any conceptual novelty in the driver's capability management lifecycle (e.g., how capabilities are delegated, managed, and revoked for an external hardware unit) that distinguishes it from prior work on capability-based OSes managing system resources?

Unified Memory Protection with Multi-granular MAC and Integrity Tree for Heterogeneous Processors

Abstract

Recent system-on-a-chip (SoC) architectures for edge systems incorporate a variety of processing units, such as CPUs, GPUs, and NPUs. Although hardware-based memory protection is crucial for the security of edge systems, conventional mechanisms experience ...

ACM LINK

Reviews

Review 1

Of course. Here is a peer review of the paper from the perspective of 'The Guardian.'

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors propose a "unified" memory protection scheme for heterogeneous SoCs, termed "multi-granular MAC&tree." The mechanism aims to dynamically select an appropriate data granularity (from four levels: 64B, 512B, 4KB, 32KB) for both MACs and integrity tree counters to reduce security overheads. The core of the proposal is a hardware access tracker that monitors memory access patterns to determine the granularity, which is then stored in a "granularity table." The architecture modifies the integrity tree structure to prune nodes for coarse-grained regions and merges MACs to reduce storage fragmentation. The authors evaluate their design using a composite simulator built from ChampSim, MGPUSim, and mNPUsim, claiming a 21.1% execution time reduction over conventional fixed-granularity protection when combined with prior subtree techniques.

While the paper addresses an important and timely problem, its central claims of performance improvement are built on a methodologically questionable foundation. The proposed dynamic detection mechanism appears fragile and its key parameters are unjustified, while the simulation framework used for evaluation likely fails to capture the critical memory contention dynamics of a real heterogeneous system.

Strengths

Problem Formulation: The paper correctly identifies a significant limitation in prior work: memory protection schemes are often device-specific, support only dual-granularity, or optimize either MACs or counters, but not both. The motivation for a unified, multi-granular approach is well-established in Section 3.
Conceptual Co-design: The idea of coordinating the granularity of both the integrity tree and the MACs is sound. The proposed mechanisms for pruning tree nodes and merging MACs (Figure 10) are logical consequences of adopting a coarser granularity.

Weaknesses

Insufficient and Fragile Granularity Detection Mechanism: The entire dynamic scheme hinges on the access tracker described in Section 4.4, and this component is critically underspecified and likely insufficient.
- The authors state they use "12 access tracker entries" for the entire SoC, a number derived from an arbitrary-seeming heuristic of "3 × (# of processing units)." A complex heterogeneous system can have dozens or hundreds of active memory streams. With only 12 entries to track 32KB chunks, the tracker is likely to experience constant thrashing, leading to continuous, expensive granularity mispredictions and switches.
- The detection algorithm is purely reactive, triggering only upon the eviction of a tracker entry (e.g., after a 16K cycle lifetime). This means the system will operate with a suboptimal granularity for potentially long periods if an access pattern changes. The 16K cycle lifetime itself is a magic number with no justification or sensitivity analysis.
- The authors themselves admit to a "probability of misprediction is 26.5%" (Section 4.4). A system that is wrong more than a quarter of the time cannot be considered robust. The performance cost of these frequent mispredictions is not adequately analyzed, only hand-waved away by invoking "lazy granularity switching."
Questionable Simulation Fidelity: The evaluation methodology described in Section 5.1 is a significant point of concern that undermines the credibility of all reported results.
- The authors state their heterogeneous simulator was created by "adding memory requests of MGPUsim and ChampSim to mNPUsim and delaying the GPU warp computation or CPU operation." This describes a trace-based approach where pre-generated traces are interleaved. This method is notoriously poor at accurately modeling memory subsystem contention. In a real SoC, the CPU, GPU, and NPUs would be competing for memory bandwidth in real-time, creating complex queuing delays that a simple "delaying" mechanism cannot possibly capture. Since memory protection overhead is predominantly a function of memory traffic and latency, this methodological flaw calls the entire set of performance results into question.
- The paper claims to model an NVIDIA Orin system, a sophisticated modern SoC. A stitched-together simulator from three separate academic tools is highly unlikely to faithfully represent the Orin's complex interconnect, memory controllers, and system-level caches.
Underestimated Overheads: The overhead analysis in Section 4.5 and elsewhere appears incomplete.
- Every memory access must first determine its granularity. This requires consulting the granularity table. For a 4GB system, this table is 2MB and resides in off-chip memory. While the authors claim this has low overhead due to locality, they provide no data on the frequency of granularity table access or the actual latency and bandwidth impact of these accesses. This is a non-trivial, persistent overhead on the critical path of every memory operation.
- The cost of a 26.5% misprediction rate is not fully accounted for. Table 2 shows that switching granularity incurs additional fetches of MACs, data chunks, and/or integrity tree nodes. A system that is constantly switching back and forth due to tracker thrashing or pattern changes will incur significant performance penalties that are not clearly isolated in the results.

Questions to Address In Rebuttal

Regarding Simulation Fidelity: Please provide a detailed explanation of how your composite simulator models memory contention at the shared memory controller and interconnect. How does your "delaying" mechanism account for queuing theory effects and bandwidth saturation in a manner that is superior to simple trace interleaving? Can you provide any data validating your simulator's timing model against real hardware?
Regarding the Access Tracker: Justify the choice of only 12 access tracker entries. Please provide a sensitivity analysis showing how performance changes as the number of tracker entries is varied. What is the measured thrashing rate in the tracker for your evaluated workloads, and how does this correlate with the high 26.5% misprediction rate?
Regarding Overheads: Provide a quantitative analysis of the granularity table access overhead. Specifically, for your workloads, what is the rate of granularity table accesses (per 1k instructions or per second), and what is the resulting increase in memory traffic and average memory access latency?
Regarding Mispredictions: A 26.5% misprediction rate is exceptionally high. Please provide a breakdown of the performance penalty incurred by these mispredictions. How much of the "lazy switching" overhead detailed in Table 2 is realized in practice, and does this cost outweigh the benefits in scenarios with frequently changing access patterns?

Review 2

Here is a peer review of the paper from the perspective of "The Synthesizer."

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the significant performance overhead of conventional, fixed-granularity memory protection schemes in modern heterogeneous Systems-on-a-Chip (SoCs). The authors correctly identify that the diverse memory access patterns of CPUs, GPUs, and NPUs make a single, fine-grained (e.g., 64B cache-line) approach inefficient.

The core contribution is a unified and dynamic multi-granular memory protection architecture, termed "multi-granular MAC&tree." This architecture moves beyond prior fragmented solutions by applying the principle of variable granularity to both Message Authentication Codes (MACs) for integrity and the integrity tree counters for freshness. The key technical innovation is a "granularity-aware integrity tree" that can prune its own branches when a memory region is promoted to a coarser granularity, thereby reducing both metadata traffic and tree traversal latency. The system dynamically detects appropriate granularities (from 64B up to 32KB) based on observed access patterns and demonstrates significant performance improvements, reducing execution time by up to 21.1% over a conventional baseline when combined with existing subtree optimization techniques.

Strengths

The most compelling aspect of this work is its role as a synthesizer of previously disparate research threads. It successfully integrates and generalizes concepts from several areas of secure computer architecture.

Holistic Unification: The primary strength is the unification of multi-granularity support for both MACs and the integrity tree. Prior work, such as Yuan et al. [56] and Common Counters [35], largely focused on optimizing one or the other. By creating a single, coherent mechanism that manages both, the authors present a more complete and practical system-level solution. This addresses a clear gap in the literature and avoids the sub-optimal performance that arises from addressing only half the problem.
Generalization Beyond Dual-Granularity: The proposal supports four levels of granularity, moving beyond the dual-granularity limitations of many prior systems. As the authors demonstrate in their motivation (Section 3.1, Figure 4, Page 4), memory access patterns in heterogeneous workloads are diverse and not well-captured by a simple fine/coarse dichotomy. This multi-level approach makes the architecture more adaptable and future-proof as SoCs become even more heterogeneous.
Pragmatic Integration with Existing Techniques: The authors wisely do not propose to replace all prior optimizations. Instead, they demonstrate that their multi-granular tree is orthogonal to and can be combined with subtree-based optimizations like Bonsai Merkle Forests [17] (as discussed in Section 2.4, Page 4). This shows a mature understanding of the design space and positions their work as a valuable addition to the architect's security toolkit, rather than an all-or-nothing replacement.
Device-Independent Philosophy: By creating a unified mechanism, the paper provides an elegant alternative to the proliferation of device-specific security solutions, particularly the tree-less, tensor-based schemes for NPUs (e.g., TNPU [29]). While those solutions are effective in their narrow domain, they contribute to design complexity and hardware fragmentation at the SoC level. This paper's approach offers a scalable path forward for securing the entire shared memory space with a single, flexible engine.

Weaknesses

The weaknesses of the paper are primarily related to the inherent trade-offs of its dynamic and unified nature, which could benefit from a more detailed discussion.

Complexity of the Dynamic Mechanism: The proposed architecture introduces non-trivial hardware for the access tracker, granularity detector, and the lazy switching logic (Section 4.4, Page 7). While the authors provide area and power estimates (Section 4.5, Page 9), the complexity cost is more than just silicon. This dynamic hardware could introduce new, subtle side channels related to granularity changes. Furthermore, the performance of the system becomes dependent on the efficacy of a heuristic-based detection algorithm, which may have pathological cases not fully explored in the evaluation.
The Granularity Table as a New Bottleneck: The system relies on a "granularity table" to store the state of memory chunks. The authors note this table resides in protected memory (Section 4.4, Page 8), but it represents a new, centralized data structure that must be consulted on memory accesses. It would be beneficial to analyze the potential for this table to become a source of contention or an additional performance bottleneck, especially in highly parallel scenarios with many agents accessing different memory regions simultaneously.
Under-explored Comparison with Hybrid Approaches: The paper argues for a unified architecture over domain-specific ones. This is a strong argument from a design simplicity standpoint. However, it is conceivable that a "best-of-breed" hybrid system—one that uses a highly optimized tree-less scheme for NPU-private data and a multi-granular tree for shared CPU/GPU data—could outperform the proposed unified model. The paper would be strengthened by a more direct discussion of the trade-offs between its unified approach and such a sophisticated hybrid model. The cost of unification is not fully quantified against its alternatives.

Questions to Address In Rebuttal

Regarding the dynamic granularity management system (Section 4.4), could the authors elaborate on the robustness of their detection heuristic? Have they identified any workload characteristics or access patterns where the dynamic detection performs poorly, leading to excessive and costly granularity switching?
The granularity table is a critical component. Can the authors provide data or analysis on the performance impact of accessing this table? Specifically, what is the overhead in terms of latency and potential memory traffic for granularity table lookups, and how is contention managed if multiple processing units trigger granularity checks concurrently?
The central premise is that unification is superior to a collection of specialized solutions. Could the authors more directly articulate the advantages of their unified approach over a hypothetical hybrid system that combines existing, highly-optimized solutions (e.g., a tree-less mechanism for NPU workloads and a dual-granular MAC/tree for GPU/CPU workloads)? What are the quantitative benefits of unification in terms of shared hardware utilization or simplified software management that justify not using the most specialized solution for each processor type?

Review 3

Of course. Here is a peer review of the paper from the perspective of "The Innovator."

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper proposes a unified, device-independent memory protection scheme for heterogeneous SoCs. The authors claim that prior work has optimized for coarse-grained access patterns by applying granularity optimizations to either Message Authentication Codes (MACs) or integrity tree counters, but not both simultaneously in a unified manner.

The core novelty claim of this work is the "multi-granular MAC&tree" architecture. This consists of two main ideas: 1. A "multi-granular integrity tree" that structurally adapts to coarse-grained memory regions. Instead of just using a shared counter, this mechanism prunes fine-grained leaf counter nodes from the tree and promotes their security responsibility to the parent node. This directly reduces the height and traversal cost of the integrity tree for those regions. 2. The unification of this novel tree mechanism with established techniques for coarse-grained MACs, all driven by a single dynamic granularity detection mechanism that supports four distinct granularity levels (64B, 512B, 4KB, 32KB).

The authors present a detailed architecture including an access tracker for dynamic detection, address computation for the reorganized MACs and counters, and a lazy switching mechanism to manage granularity changes.

Strengths

From a novelty perspective, the paper's primary strength is the proposal of a structurally adaptive integrity tree.

Novel Integrity Tree Mechanism: The concept of dynamically pruning leaf nodes and promoting their function to a parent node (as depicted in Figure 10, page 6) is a genuinely new mechanism in the context of hardware-managed integrity trees for memory protection. While related to prior goals of reducing tree traversal, the method is distinct. Prior works like Bonsai Merkle Forest [17] or PENGLAI [16] focus on creating "subtrees" and caching their roots on-chip. This paper, in contrast, modifies the logical structure of the main tree itself, which is a more fundamental architectural change. This "node promotion" is the paper's most significant novel idea.
True Unification: The authors correctly identify a gap in the literature: the disjointed optimization of counters and MACs. While coarse-grained MACs [56] and coarse-grained counters [35] are known concepts, this paper is the first to propose a single, coherent architecture that modifies both the MAC storage layout and the integrity tree structure using a single, unified detection and management framework. This integration represents a novel system design.

Weaknesses

While the core unification and tree adaptation ideas are novel, several of the supporting components are direct, and sometimes incremental, extensions of prior art.

Recycled Detection Mechanism: The dynamic detection mechanism described in Section 4.4 (page 7) is not new. The use of an access tracker with a one-hot vector (or bitmask) to monitor accesses within a larger memory chunk and identify contiguous streaming patterns is functionally identical to the mechanism proposed and used by Yuan et al. [56] for dual-granular MACs. The extension from two granularity levels to four is an incremental improvement, not a fundamental one.
Coarse-Grained MACs are Well-Established: The idea of merging multiple fine-grained MACs into a single coarse-grained MAC via nested hashing (Equation 5, page 7) is the standard and logical way to implement coarse-grained MACs. This concept was previously established in work on dual-granular MACs [56]. The paper applies this known technique, extending it to more levels, but does not invent it.

The central novelty, therefore, rests almost exclusively on the integrity tree modification and its co-design with the MAC scheme, rather than on the individual components that enable it. The contribution is one of novel architectural integration and a new tree optimization, not a collection of individually novel algorithms.

Questions to Address In Rebuttal

Novelty vs. Subtree Optimizations: The paper combines its technique with subtree optimizations, but the distinction could be clearer. Could the authors further elaborate on the fundamental trade-offs between the proposed "node promotion" mechanism and prior "subtree root caching" [17]? For instance, does node promotion offer advantages in terms of metadata cache pressure or management complexity that make it inherently superior, or is it simply an alternative method to achieve the same end?
Complexity of Lazy Switching: The lazy switching mechanism to handle granularity mispredictions adds significant complexity, requiring storage for both "current" and "next" granularities in the granularity table (Section 4.4, page 8). What percentage of the performance gain is attributable to this lazy approach versus a simpler, more aggressive switching policy? Is this substantial complexity justified over a simpler design that may achieve, for example, 90% of the benefit?
Scalability of Granularity Levels: The paper proposes four fixed, power-of-8 levels of granularity (64B, 512B, 4KB, 32KB), which aligns neatly with an 8-arity tree. How does the proposed node promotion and address calculation logic scale if the system required more, or non-uniform, levels of granularity? What are the architectural barriers to supporting, for instance, eight levels of granularity or adapting to a 4-arity tree instead? This would help clarify the generality of the proposed novel mechanism.

SpecASan: Mitigating Transient Execution Attacks Using Speculative Address Sanitization

Abstract

Transient execution attacks (TEAs), such as Spectre and Meltdown, exploit speculative execution to leak sensitive data through residual microarchitectural state. Traditional defenses often incur high performance and hardware costs by delaying speculative ...

ACM LINK

Reviews

Review 1

Here is a peer review of the paper in the persona of 'The Guardian.'

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors propose SpecASan, a hardware mechanism intended to mitigate transient execution attacks (TEAs) by extending the memory safety guarantees of ARM's Memory Tagging Extension (MTE) to the speculative execution path. The core idea is to treat many TEAs as "speculative memory safety violations." When a speculative memory access results in an MTE tag mismatch, SpecASan delays the instruction and its dependents until speculation resolves. The authors claim this approach effectively mitigates a broad class of attacks, including Spectre and MDS, with low performance overhead and minimal hardware complexity.

Strengths

The fundamental premise—that enforcing memory safety during speculation can block the initial ACCESS stage of many TEAs—is logically sound for attacks that rely on classic memory safety violations (e.g., Spectre-v1 bounds check bypass).
The paper is well-structured and the proposed microarchitectural changes are clearly described in Section 3. The state machine in Figure 4 (page 6) provides a coherent overview of the operational logic.
The performance evaluation on benign workloads (SPEC CPU2017, PARSEC) demonstrates a low overhead (Section 5.3, page 10), which, if the security claims hold, would be a notable result.

Weaknesses

My analysis reveals significant weaknesses in the paper's threat model, security evaluation, and methodological rigor. The claims of robust security and minimal complexity are not adequately substantiated.

Critical Flaw in the Threat Model: The security of the entire SpecASan mechanism is predicated on the integrity of MTE tags. However, the authors explicitly place attacks that leak the MTE tag itself as "out-of-scope" (Section 3.1, page 4). This is an unacceptable limitation. The referenced work [40] demonstrates that MTE tags can indeed be leaked via speculative execution. By excluding this, the authors sidestep the most critical attack against their own foundation. Relying on software conventions like "deterministic tag assignment" [33] as a remedy shifts the burden to software and undermines the claim of a robust hardware defense. If the tag can be leaked or guessed, an attacker can craft a pointer with a matching tag and bypass the entire defense.
Unsubstantiated Security Claims against Contention Channels: The paper claims to mitigate Speculative Contention Channel (SCC) attacks like SMOTHERSpectre (Table 1, page 8). The justification provided is that "By enforcing strict speculative memory safety, SpecASan eliminates the root cause of speculative timing/contention leaks" (Section 4.1, page 8). This is a logical leap. SpecASan's delay mechanism prevents the result of an unsafe load from being used, but it does not prevent the memory access from creating contention in the memory subsystem before the tag check outcome is known. Furthermore, the delay mechanism itself introduces a new, predictable timing behavior (stall vs. no stall) that could be exploited as a side channel to learn whether a speculative access would have been safe or unsafe, leaking information about the memory layout or tag values. The paper provides no evidence to counter this possibility.
Insufficient Empirical Security Validation: The security evaluation (Section 4.3, page 9) is weak. The authors concede that an end-to-end attack is "infeasible in simulation," and instead rely on "monitoring detection logs for malicious speculative accesses." This methodology only confirms that the system detects the conditions it was designed to detect (i.e., tag mismatches). It does not prove the absence of information leakage. A sophisticated adversary does not need to complete an unsafe memory access if the attempt to access it leaks information. This evaluation method is insufficient for a security paper.
Questionable Validity of the MDS Mitigation Model: The mitigation for MDS attacks like ZombieLoad hinges on extending tag checks to the Line Fill Buffer (LFB). However, the authors state that they implemented a "simplified LFB model, inspired by the Intel processor's design" because the baseline ARM architecture "natively lacks an LFB" (Section 5.1, page 9). The security properties of microarchitectural buffers are notoriously complex and implementation-specific. Evaluating a defense against LFB vulnerabilities on a non-native, simplified model severely threatens the external validity of the findings. There is no assurance that this model captures the subtle forwarding paths that real-world MDS attacks exploit.
Incomplete Defense Requiring an External Mechanism: The authors admit that SpecASan alone only provides "partial mitigation" against control-flow attacks like Spectre-BTB and RSB (Section 4.2, page 8). The system remains vulnerable if an attacker redirects control flow to a gadget that operates on data with a valid tag. The proposed solution is to combine SpecASan with a separate mechanism, SpecCFI. While the combination may be effective, this demonstrates that SpecASan is not the "comprehensive" or "broad" solution the abstract claims it to be. The efficacy of the core proposal is limited.
Under-analyzed Performance and Complexity: The performance analysis exclusively uses benign benchmarks that are unlikely to trigger the delay mechanism frequently. The paper lacks a worst-case analysis using microbenchmarks designed to maximize the rate of unsafe speculative accesses. Without this, the true performance cost of the mechanism remains unknown. Similarly, the claim of "minimal hardware complexity" is not rigorously defended. The introduction of the Tag-Check Status Handler (TSH) and its broadcast mechanism to the ROB is non-trivial. The paper's assertion that this could be a "single-cycle operation" in a "small-scale ROB" (Section 3.4, page 6) is an optimistic assumption for modern high-performance cores with large, complex ROBs. The true latency and area cost are not properly quantified.

Questions to Address In Rebuttal

The authors must address the following points directly in their rebuttal:

How can SpecASan be considered a robust defense when its foundational element, the MTE tag, can be leaked speculatively—an attack vector you have explicitly defined as out-of-scope? Please justify this threat model decision.
Provide a detailed, step-by-step explanation of how SpecASan prevents information leakage through contention-based channels (e.g., memory bus or MSHR contention). How do you prove that the delay mechanism itself does not create a new, exploitable timing side channel?
Given that your security evaluation for MDS attacks relies on a simplified, non-native LFB model, what evidence can you provide that this model accurately captures the microarchitectural forwarding paths exploited by known MDS variants?
Please provide performance data from a microbenchmark designed to maximize the frequency of unsafe speculative accesses. This is necessary to quantify the worst-case performance overhead of your delay mechanism.
What is the projected cycle latency for the TSH-to-ROB dependency broadcast in a modern, wide-issue processor with a ROB size of over 200 entries? How does this latency impact the performance overhead calculations?

Review 2

Excellent. Taking on the persona of "The Synthesizer," I will provide a contextual analysis of the paper.

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents SpecASan, a novel approach for mitigating a broad class of transient execution attacks (TEAs). The core contribution is not the invention of a new, complex hardware mechanism, but rather a profound and elegant reframing of the problem itself. Instead of treating TEAs as a side-channel leakage issue that must be cleaned up or contained, the authors model them as speculative violations of software-defined memory safety contracts.

To enforce these contracts, SpecASan leverages and extends an existing, commercially-deployed hardware feature—ARM's Memory Tagging Extension (MTE)—from the committed architectural path to the speculative microarchitectural path. When a speculative memory access fails an MTE tag check, it is selectively delayed until the speculation resolves. This prevents the initial unauthorized access that underpins most Spectre and MDS-style attacks, effectively cutting them off at the source. The authors demonstrate through simulation that this approach provides robust security with exceptionally low performance overhead (1.8-2.5%) and minimal additional hardware complexity beyond MTE itself.

Strengths

Conceptual Elegance and Reframing: The paper’s greatest strength lies in its intellectual contribution of shifting the problem space. The insight that most TEAs are fundamentally speculative permission violations—and can be stopped by enforcing permissions speculatively—is powerful. It moves the defense from the complex and often leaky TRANSMIT stage to the root of the problem at the ACCESS stage (as shown in Figure 1, page 2), but does so with a scalpel rather than a sledgehammer.
Pragmatism and A Clear Path to Adoption: This work stands out from a large body of academic proposals due to its practicality. By building upon ARM MTE, a security feature already present in consumer devices like the Google Pixel and Samsung Galaxy phones, the authors ground their work in reality. The marginal hardware cost to implement SpecASan on a processor that already supports MTE would be minimal, dramatically lowering the barrier to entry for industry adoption compared to proposals requiring entirely new shadow structures or complex dynamic information flow tracking (DIFT) engines.
Excellent Balance of Security and Performance: SpecASan strikes an impressive balance. It offers strong security against a wide range of attacks (Spectre-v1, -v4, MDS variants) by preventing the initial illegal read. Yet, its performance overhead is negligible because it only intervenes on the rare occasion of an unsafe speculative access. This contrasts sharply with heavyweight DIFT systems like STT [89] or coarse-grained fencing mechanisms, which incur significant performance penalties. The data in Figure 8 (page 11), showing the tiny fraction of instructions restricted by SpecASan compared to other methods, makes this point compellingly.
Synergy with the Broader Security Landscape: The paper correctly identifies that its memory-safety-based approach is a perfect complement to control-flow safety mechanisms. The discussion and evaluation of combining SpecASan with a CFI mechanism (SpecCFI) to provide more complete protection (Section 4.2, page 8 and Figure 9, page 11) shows a mature understanding of a defense-in-depth security architecture. This positions SpecASan not as a silver bullet, but as a crucial and highly efficient layer in a comprehensive security solution.

Weaknesses

While the core idea is excellent, its current instantiation and positioning have some limitations that are worth considering.

Security is Fundamentally Bounded by the Primitive: The security guarantees of SpecASan are inextricably linked to the guarantees of its underlying primitive, ARM MTE. As the authors rightly acknowledge in the discussion (Section 6, page 11), MTE has known limitations: a small 4-bit tag space leading to potential collisions, a coarse 16-byte granularity that can hide intra-granule buffer overflows, and recent research demonstrating speculative tag-leaking attacks [4, 40]. While the use of deterministic tagging can mitigate some of these issues, the paper's security claims are ultimately constrained by the strength (or weakness) of MTE itself. The contribution is the concept of speculative sanitization, but its current implementation inherits these weaknesses.
Limited Scope of "Permission Model": The paper implicitly equates "software-defined security contracts" with the specific memory safety model enforced by MTE (spatial and temporal safety). This is a powerful model, but it is not exhaustive. A transient attack could potentially be constructed that does not violate MTE's memory safety rules but does violate a higher-level access control policy (e.g., accessing correctly tagged memory that is nonetheless off-limits based on the current process context). The framework is stronger than the specific use-case presented.

Questions to Address In Rebuttal

The core idea of SpecASan—enforcing software-defined security contracts in the speculative domain—seems generalizable. How would the design and its security guarantees change if built upon a stronger hardware primitive with finer granularity and a larger key space, such as the capability model in CHERI? Would this allow you to mitigate a wider class of attacks or provide stronger proofs against tag collisions/leakage?
Could the authors elaborate on the precise boundary of attacks SpecASan can and cannot prevent? For instance, attacks that do not rely on an initial out-of-bounds memory read, such as Load Value Injection (LVI) where an attacker injects malicious data into microarchitectural buffers, seem to fall outside the direct protection of this memory-centric approach. A clearer delineation of the threat model would strengthen the paper.
There is a subtle but important shift in the usage model of MTE. MTE is primarily deployed for probabilistic bug detection in software. SpecASan repurposes it for deterministic, always-on security enforcement. Does this shift place a new, implicit burden on software developers? For example, for SpecASan to be fully effective, software must meticulously and correctly apply memory tags to all sensitive data regions using a robust (perhaps deterministic) tagging scheme. Could you comment on the software ecosystem changes required to fully realize the vision of SpecASan?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The central thesis of this paper is the application of hardware-enforced memory safety, specifically modeled after ARM's Memory Tagging Extension (MTE), to the speculative execution path to mitigate transient execution attacks (TEAs). The authors propose Speculative Address Sanitization (SpecASan), a mechanism that checks memory tags during speculative execution. If a speculative memory access violates its associated tag (indicating a potential memory safety violation like an out-of-bounds access), the access is delayed until the speculation resolves. This prevents the initial "ACCESS" stage of many TEAs, thereby stopping data from ever being read into microarchitectural buffers.

The core contribution is the shift from post-access information flow control (as seen in Speculative Taint Tracking) or coarse-grained access control (privilege-level checks) to a fine-grained, policy-based access control for speculative operations. This specific mechanism, while built on existing primitives (speculation, memory tagging), appears to be a novel and elegant synthesis for addressing a significant class of TEAs.

Strengths

Novelty of Mechanism: The primary strength of this paper is its novel approach. While the components—memory tagging (MTE) and delaying speculative instructions—are not new in themselves, their synthesis into a cohesive "speculative access control" system based on software-defined memory safety contracts is new. Prior art has focused on:
- Tracking Information Flow: Speculative Taint Tracking (STT) [89] allows the speculative access to occur but taints the resulting data, preventing its use or transmission. SpecASan intervenes earlier and more fundamentally by preventing the data from being returned from memory/cache in the first place.
- Coarse-Grained Access Control: NDA [81] also targets the "ACCESS" stage but does so by delaying loads that cross privilege boundaries. SpecASan enforces a different and more granular policy (memory safety), allowing it to prevent attacks like Spectre-v1 that operate entirely within a single privilege domain.
- Heuristic-Based Fencing: Other techniques insert fences or use speculative load hardening, but the decision of what to delay is often based on heuristics (e.g., branch confidence) or static analysis. SpecASan makes a dynamic, hardware-enforced decision based on a concrete security property (the memory tag match), which is a more precise trigger.
Conceptual Elegance: The idea of extending a security property that is already enforced on the committed path (memory safety) to the speculative path is clean and intuitive. It frames many TEAs not as a new class of vulnerability, but as a failure to consistently enforce existing security contracts during transient execution.
Plausible Implementation Path: By grounding the design in an existing, industry-supported ISA extension (ARM MTE), the authors present a path to implementation that is more credible than a completely bespoke architecture. The proposed microarchitectural changes (Section 3.3, page 4), while non-trivial, are logical extensions of an MTE-enabled core.

Weaknesses

Contingent Novelty: The novelty of the solution is contingent on the effectiveness of the underlying memory tagging scheme. The paper acknowledges this limitation in Section 6 ("Limitation of ARM MTE", page 11). The core idea is "enforce software contracts speculatively"; the paper's only instantiation is via MTE. However, MTE itself has a small tag space (4 bits), making it vulnerable to tag guessing and collisions, as demonstrated by prior work [4, 40]. If the tags can be bypassed, SpecASan's protection evaporates. The novelty of the paper is therefore tied to a primitive with known weaknesses.
Limited Scope of Novelty: The proposed mechanism is fundamentally tied to memory safety violations. It provides no protection against TEAs that do not manifest as an out-of-bounds or use-after-free access. For example, attacks that leverage valid-but-unintended accesses within the same memory object, or attacks on non-memory components, fall outside the scope. The novelty is potent but narrowly applied to a specific subset of the TEA problem space.
Incremental Microarchitectural Changes: While the system-level idea is novel, the microarchitectural changes themselves are largely additive extensions of MTE. They consist of adding tag storage and comparators to various buffers (LSQ, LFB, Caches). While this is a practical engineering approach, it means the novelty lies more in the "what" and "why" than in a groundbreaking "how" at the circuit or microarchitectural-structure level.

Questions to Address In Rebuttal

The paper compares itself to NDA [81], which also prevents attacks at the "ACCESS" stage. While your approach enforces a different policy (memory safety vs. privilege level), could you elaborate on a concrete attack scenario that SpecASan prevents but a system like NDA would miss, and vice versa? This would help further sharpen the "delta" of your contribution over the closest prior art that also targets the "ACCESS" stage.
The security of SpecASan is fundamentally predicated on the integrity of the underlying memory tagging scheme. The rebuttal should address the threat model of MTE tag leakage/guessing attacks [4, 32, 40]. How does the possibility of a speculative tag-guessing attack, which determines the correct tag to bypass SpecASan's check, affect the claim of providing "robust" protection against TEAs? Does relying on deterministic tagging [33, 67], as suggested, fundamentally alter the threat model or performance assumptions?
The core novel insight is "speculative contract enforcement," with memory safety being the specific contract chosen. Have the authors considered the generalizability of this framework? Could this model be extended to enforce other types of software-defined contracts during speculation (e.g., type safety, dataflow integrity properties) using different metadata schemes? Discussing this could strengthen the claim that the contribution is a new paradigm rather than a single point solution.