Category: Computer Architecture

  • Breaking Boundaries: The Quantum Revolution Accelerates with Google’s 105-Qubit Willow Chip and Beyond

    In the rapidly evolving field of quantum computing, recent advancements have propelled the technology closer to practical, real-world applications. A significant milestone was achieved with the unveiling of Google’s 105-qubit Willow chip, which demonstrated unprecedented computational capabilities and advancements in quantum error correction. This breakthrough is part of a broader trend of innovations by leading…

  • High-Performance Matrix Multiplication with FLAME/BLIS: A Deep Dive into DGEMM

    When it comes to scientific computing and machine learning, efficient matrix multiplication is a fundamental building block. Among the most critical operations in linear algebra libraries is DGEMM (Double-precision General Matrix Multiply), which computes the product of two double-precision matrices. In the quest for optimal performance, the BLAS (Basic Linear Algebra Subprograms) interface has been…

  • Navigating Memory Management in NUMA Architectures with Dual Memory Technologies

    Modern applications are increasingly complex, requiring not only higher compute power but also sophisticated memory solutions to achieve optimal performance. NUMA (Non-Uniform Memory Access) architectures are designed to tackle memory performance challenges in systems with multiple processors, allowing each processor to access its own local memory faster than it can access memory attached to other…

  • xAI Colossus: Musk’s HPC Powerhouse Set to Transform AI

    xAI Colossus: Musk’s HPC Powerhouse Set to Transform AI Elon Musk’s xAI initiative is powered by a supercomputing cluster called Colossus, designed for high-performance computing (HPC) on an unprecedented scale. The system features over 10,000 Nvidia H100 GPUs, making it one of the most powerful AI-focused clusters in the world. The integration of these GPUs…

  • ARM SVE2 explained

    Scalable Vector Extension 2 (SVE2) is an updated version of the Scalable Vector Extension (SVE), an instruction set introduced by ARM for its processors to improve performance in high-performance computing (HPC), artificial intelligence (AI), and machine learning (ML). SVE2 builds on SVE, offering several improvements aimed at enhancing performance and versatility, especially for general-purpose and…

  • VNET vs VC in gem5 Garnet NoC

    When one is first introduced in Garnet NoC, he might find confusing the terms VNET and VC. A simple explanation follows. VNET (Virtual Network) can be considered a separate physical channel which carries a specific type of messages. More specifically, VNET is directly related with the Cache Coherence protocol that is used by the user.…

  • Empirical roofline tool (ERT) – a benchmark for machine performance characterization

    A well known and very useful benchmark for characterizing a machine performance is the Empirical Roofline Tool (ERT). The Empirical Roofline Tool, ERT, automatically generates a roofline data for a given computer. This includes the maximum bandwidth for the various levels of the memory hierarchy and the maximum gflop rate. This data is obtained using…

  • Benchmark Graviton3E vs Graviton3

    We benchmark the recently released HPC platform: Amazon-Graviton3E. Amazon recently made available the HPC version of Graviton3 named Graviton3E. According to them, the new Hpc7g instances provide up to 35 percent higher vector instruction processing performance in relation to the simple Graviton3. Additionally, Graviton3E provides two times better floating-point performance in comparison to Graviton2. All…

  • ARM SVE Explained

    ARM Scalable Vector Extension (SVE) is an innovative vector processing technology designed by ARM Holdings, primarily for their ARM-based processors. Here’s a concise explanation in 10 sentences:

  • Modeling ARM Cortex-A76 in gem5

    One can model ARM Cortex processors (e.g. Cortex-A53, Cortex-A76, Cortex-A77) in gem5 by extending the detailed Out of Order (O3) processor model. This can be done in two simple steps. First one needs to adjust “gem5/src/cpu/o3/BaseO3CPU.py“, in order to set the various CPU parameters like fetchWidth, decodeWidth, issueWidth, LQEntries, SQEntries etc., according to the desired…

  • NUMA and why it matters

    Non-Uniform Memory Access (NUMA) is a computer architecture design that can significantly impact the performance and scalability of multi-processor systems. Here are five reasons why NUMA matters: In summary, NUMA matters because it addresses memory access latency, improves system scalability, allows for workload optimization, ensures cache coherency, and contributes to energy efficiency in multi-processor systems,…

  • About memory compression

    Memory compression is a technique used to reduce the amount of memory that is being used by a computer system. It works by compressing the data that is stored in memory, which allows more data to be stored in the same amount of physical memory. The basic idea behind memory compression is to identify areas…

  • The problem with CPU frequencies

    The phenomenon of relatively stagnant CPU (Central Processing Unit) frequencies over the last decade is a result of several technological and physical limitations: Instead of focusing on increasing clock speeds, CPU manufacturers have adopted a more holistic approach to improving performance. They have been investing in: While the gigahertz race that characterized CPU development in…