Blue Gene was an
IBM project aimed at designing supercomputers that can reach operating speeds in the
petaFLOPS (PFLOPS) range, with low power consumption.
The project created three generations of supercomputers, Blue Gene/L, Blue Gene/P, and Blue Gene/Q. During their deployment, Blue Gene systems often led the
TOP500[1] and
Green500[2] rankings of the most powerful and most power-efficient supercomputers, respectively. Blue Gene systems have also consistently scored top positions in the
Graph500 list.[3] The project was awarded the 2009
National Medal of Technology and Innovation.[4]
After Blue Gene/Q, IBM focused its supercomputer efforts on the
OpenPower platform, using accelerators such as
FPGAs and
GPUs to address the diminishing returns of
Moore's law.[5][6]
History
A video presentation of the history and technology of the Blue Gene project was given at the Supercomputing 2020 conference.[7]
In December 1999, IBM announced a US$100 million research initiative for a five-year effort to build a massively
parallel computer, to be applied to the study of biomolecular phenomena such as
protein folding.[8] The research and development was pursued by a large multi-disciplinary team at the
IBM T. J. Watson Research Center, initially led by
William R. Pulleyblank.[9]
The project had two main goals: to advance understanding of the mechanisms behind protein folding via large-scale simulation, and to explore novel ideas in massively parallel machine architecture and software. Major areas of investigation included: how to use this novel platform to effectively meet its scientific goals, how to make such massively parallel machines more usable, and how to achieve performance targets at a reasonable cost, through novel machine architectures.
The initial design for Blue Gene was based on an early version of the
Cyclops64 architecture, designed by
Monty Denneau. In parallel, Alan Gara had started working on an extension of the
QCDOC architecture into a more general-purpose supercomputer. The
US Department of Energy started funding the development of this system and it became known as Blue Gene/L (L for Light). Development of the original Blue Gene architecture continued under the name Blue Gene/C (C for Cyclops) and, later, Cyclops64.
In November 2004 a 16-
rack system, with each rack holding 1,024 compute nodes, achieved first place in the
TOP500 list, with a
LINPACK benchmarks performance of 70.72 TFLOPS.[1] It thereby overtook NEC's
Earth Simulator, which had held the title of the fastest computer in the world since 2002. From 2004 through 2007 the Blue Gene/L installation at LLNL[10] gradually expanded to 104 racks, achieving 478 TFLOPS Linpack and 596 TFLOPS peak. The LLNL BlueGene/L installation held the first position in the TOP500 list for 3.5 years, until in June 2008 it was overtaken by IBM's Cell-based
Roadrunner system at
Los Alamos National Laboratory, which was the first system to surpass the 1 PetaFLOPS mark.
While the LLNL installation was the largest Blue Gene/L installation, many smaller installations followed. The November 2006
TOP500 list showed 27 computers with the eServer Blue Gene Solution architecture. For example, three racks of Blue Gene/L were housed at the
San Diego Supercomputer Center.
While the
TOP500 measures performance on a single benchmark application, Linpack, Blue Gene/L also set records for performance on a wider set of applications. Blue Gene/L was the first supercomputer ever to run over 100
TFLOPS sustained on a real-world application, namely a three-dimensional molecular dynamics code (ddcMD), simulating solidification (nucleation and growth processes) of molten metal under high pressure and temperature conditions. This achievement won the 2005
Gordon Bell Prize.
In June 2006,
NNSA and IBM announced that Blue Gene/L achieved 207.3 TFLOPS on a quantum chemical application (
Qbox).[11] At Supercomputing 2006,[12] Blue Gene/L was awarded the winning prize in all HPC Challenge Classes of awards.[13] In 2007, a team from the
IBM Almaden Research Center and the
University of Nevada ran an
artificial neural network almost half as complex as the brain of a mouse for the equivalent of a second (the network was run at 1/10 of normal speed for 10 seconds).[14]
The name
The name Blue Gene comes from what it was originally designed to do, help biologists understand the processes of
protein folding and
gene development.[15] "Blue" is a traditional moniker that IBM uses for many of its products and
the company itself. The original Blue Gene design was renamed "Blue Gene/C" and eventually
Cyclops64. The "L" in Blue Gene/L comes from "Light" as that design's original name was "Blue Light". The "P" version was designed to be a
petascale design. "Q" is just the letter after "P".[16]
Major features
The Blue Gene/L supercomputer was unique in the following aspects:[17]
Trading the speed of processors for lower power consumption. Blue Gene/L used low frequency and low power embedded PowerPC cores with floating-point accelerators. While the performance of each chip was relatively low, the system could achieve better power efficiency for applications that could use large numbers of nodes.
Dual processors per node with two working modes: co-processor mode where one processor handles computation and the other handles communication; and virtual-node mode, where both processors are available to run user code, but the processors share both the computation and the communication load.
System-on-a-chip design. Components were embedded on a single chip for each node, with the exception of 512 MB external DRAM.
A large number of nodes (scalable in increments of 1024 up to at least 65,536).
Three-dimensional
torus interconnect with auxiliary networks for global communications (broadcast and reductions), I/O, and management.
Lightweight OS per node for minimum system overhead (system noise).
Architecture
The Blue Gene/L architecture was an evolution of the QCDSP and
QCDOC architectures. Each Blue Gene/L Compute or I/O node was a single
ASIC with associated
DRAM memory chips. The ASIC integrated two 700 MHz
PowerPC 440 embedded processors, each with a double-pipeline-double-precision
Floating-Point Unit (FPU), a
cache sub-system with built-in DRAM controller and the logic to support multiple communication sub-systems. The dual FPUs gave each Blue Gene/L node a theoretical peak performance of 5.6
GFLOPS (gigaFLOPS). The two CPUs were not
cache coherent with one another.
Compute nodes were packaged two per compute card, with 16 compute cards (thus 32 nodes) plus up to 2 I/O nodes per node board. A cabinet/rack contained 32 node boards.[18] By the integration of all essential sub-systems on a single chip, and the use of low-power logic, each Compute or I/O node dissipated about 17 watts (including DRAMs). The low power per node allowed aggressive packaging of up to 1024 compute nodes, plus additional I/O nodes, in a standard
19-inch rack, within reasonable limits on electrical power supply and air cooling. The system performance metrics, in terms of
FLOPS per watt, FLOPS per m2 of floorspace and FLOPS per unit cost, allowed scaling up to very high performance. With so many nodes, component failures were inevitable. The system was able to electrically isolate faulty components, down to a granularity of half a rack (512 compute nodes), to allow the machine to continue to run.
Each Blue Gene/L node was attached to three parallel communications networks: a
3Dtoroidal network for peer-to-peer communication between compute nodes, a collective network for collective communication (broadcasts and reduce operations), and a global interrupt network for
fast barriers. The I/O nodes, which run the
Linuxoperating system, provided communication to storage and external hosts via an
Ethernet network. The I/O nodes handled filesystem operations on behalf of the compute nodes. A separate and private
Ethernet management network provided access to any node for configuration,
booting and diagnostics.
To allow multiple programs to run concurrently, a Blue Gene/L system could be partitioned into electronically isolated sets of nodes. The number of nodes in a partition had to be a positive
integer power of 2, with at least 25 = 32 nodes. To run a program on Blue Gene/L, a partition of the computer was first to be reserved. The program was then loaded and run on all the nodes within the partition, and no other program could access nodes within the partition while it was in use. Upon completion, the partition nodes were released for future programs to use.
Blue Gene/L compute nodes used a minimal
operating system supporting a single user program. Only a subset of
POSIX calls was supported, and only one process could run at a time on a node in co-processor mode—or one process per CPU in virtual mode. Programmers needed to implement
green threads in order to simulate local concurrency. Application development was usually performed in
C,
C++, or
Fortran using
MPI for communication. However, some scripting languages such as
Ruby[19] and
Python[20] have been ported to the compute nodes.
IBM published BlueMatter, the application developed to exercise Blue Gene/L, as open source.[21] This serves to document how the torus and collective interfaces were used by applications, and may serve as a base for others to exercise the current generation of supercomputers.
The design of Blue Gene/P is a technology evolution from Blue Gene/L. Each Blue Gene/P Compute chip contains four
PowerPC 450 processor cores, running at 850 MHz. The cores are
cache coherent and the chip can operate as a 4-way
symmetric multiprocessor (SMP). The memory subsystem on the chip consists of small private L2 caches, a central shared 8 MB L3 cache, and dual
DDR2 memory controllers. The chip also integrates the logic for node-to-node communication, using the same network topologies as Blue Gene/L, but at more than twice the bandwidth. A compute card contains a Blue Gene/P chip with 2 or 4 GB DRAM, comprising a "compute node". A single compute node has a peak performance of 13.6 GFLOPS. 32 Compute cards are plugged into an air-cooled node board. A
rack contains 32 node boards (thus 1024 nodes, 4096 processor cores).[23]
By using many small, low-power, densely packaged chips, Blue Gene/P exceeded the
power efficiency of other supercomputers of its generation, and at 371
MFLOPS/W Blue Gene/P installations ranked at or near the top of the
Green500 lists in 2007–2008.[2]
Installations
The following is an incomplete list of Blue Gene/P installations. Per November 2009, the
TOP500 list contained 15 Blue Gene/P installations of 2-racks (2048 nodes, 8192 processor cores, 23.86
TFLOPSLinpack) and larger.[1]
On November 12, 2007, the first Blue Gene/P installation,
JUGENE, with 16 racks (16,384 nodes, 65,536 processors) was running at
Forschungszentrum Jülich in
Germany with a performance of 167 TFLOPS.[24] When inaugurated it was the fastest supercomputer in Europe and the sixth fastest in the world. In 2009, JUGENE was upgraded to 72 racks (73,728 nodes, 294,912 processor cores) with 144 terabytes of memory and 6 petabytes of storage, and achieved a peak performance of 1 PetaFLOPS. This configuration incorporated new air-to-water heat exchangers between the racks, reducing the cooling cost substantially.[25] JUGENE was shut down in July 2012 and replaced by the Blue Gene/Q system
JUQUEEN.
The 40-rack (40960 nodes, 163840 processor cores) "Intrepid" system at
Argonne National Laboratory was ranked #3 on the June 2008 Top 500 list.[26] The Intrepid system is one of the major resources of the INCITE program, in which processor hours are awarded to "grand challenge" science and engineering projects in a peer-reviewed competition.
A 2.5 rack Blue Gene/P system is the central processor for the Low Frequency Array for Radio astronomy (
LOFAR) project in the Netherlands and surrounding European countries. This application uses the streaming data capabilities of the machine.
In 2011, a 2-rack Blue Gene/P was installed at
University of Canterbury in Christchurch, New Zealand.
In 2012, a 2-rack Blue Gene/P was installed at
Rutgers University in Piscataway, New Jersey. It was dubbed "Excalibur" as an homage to the Rutgers mascot, the Scarlet Knight.[30]
The first Blue Gene/P in the ASEAN region was installed in 2010 at the
Universiti of Brunei Darussalam’s research centre, the
UBD-IBM Centre. The installation has prompted research collaboration between the university and IBM research on climate modeling that will investigate the
impact of climate change on flood forecasting, crop yields, renewable energy and the health of rainforests in the region among others.[32]
In 2013, a 1-rack Blue Gene/P was donated to the Department of Science and Technology for weather forecasts, disaster management, precision agriculture, and health it is housed in the National Computer Center, Diliman, Quezon City, under the auspices of Philippine Genome Center (PGC) Core Facility for Bioinformatics (CFB) at UP Diliman, Quezon City.[33]
Applications
Veselin Topalov, the challenger to the
World Chess Champion title in 2010, confirmed in an interview that he had used a Blue Gene/P supercomputer during his preparation for the match.[34]
The Blue Gene/P computer has been used to simulate approximately one percent of a human cerebral cortex, containing 1.6 billion
neurons with approximately 9 trillion connections.[35]
The
IBM Kittyhawk project team has ported Linux to the compute nodes and demonstrated generic Web 2.0 workloads running at scale on a Blue Gene/P. Their paper, published in the ACM Operating Systems Review, describes a kernel driver that tunnels Ethernet over the tree network, which results in all-to-all
TCP/IP connectivity.[36][37] Running standard Linux software like
MySQL, their performance results on SpecJBB rank among the highest on record.[citation needed]
In 2011, a Rutgers University / IBM / University of Texas team linked the
KAUSTShaheen installation together with a Blue Gene/P installation at the
IBM Watson Research Center into a "federated high performance computing cloud", winning the IEEE SCALE 2011 challenge with an oil reservoir optimization application.[38]
Blue Gene/Q
The third design in the Blue Gene series, Blue Gene/Q, significantly expanded and enhanced on the Blue Gene/L and /P architectures.
Design
The Blue Gene/Q "compute chip" is based on the
64-bitIBM A2 processor core. The A2 processor core is 4-way
simultaneously multithreaded and was augmented with a
SIMD quad-vector
double-precisionfloating-point unit (IBM QPX). Each Blue Gene/Q compute chip contains 18 such A2 processor cores, running at 1.6 GHz. 16 Cores are used for application computing and a 17th core is used for handling operating system assist functions such as
interrupts,
asynchronous I/O,
MPI pacing, and
RAS. The 18th core is a
redundant manufacturing spare, used to increase yield. The spared-out core is disabled prior to system operation. The chip's processor cores are linked by a crossbar switch to a 32 MB
eDRAM L2 cache, operating at half core speed. The L2 cache is multi-versioned—supporting
transactional memory and
speculative execution—and has hardware support for
atomic operations.[39] L2 cache misses are handled by two built-in
DDR3 memory controllers running at 1.33 GHz. The chip also integrates logic for chip-to-chip communications in a 5D
torus configuration, with 2 GB/s chip-to-chip links. The Blue Gene/Q chip is manufactured on IBM's copper SOI process at 45 nm. It delivers a peak performance of 204.8 GFLOPS while drawing approximately 55 watts. The chip measures 19×19 mm (359.5 mm²) and comprises 1.47 billion transistors. Completing the compute node, the chip is mounted on a compute card along with 16 GB
DDR3DRAM (i.e., 1 GB for each user processor core).[40]
A Q32[41] "compute drawer" contains 32 compute nodes, each water cooled.[42]
A "midplane" (crate) contains 16 Q32 compute drawers for a total of 512 compute nodes, electrically interconnected in a 5D torus configuration (4x4x4x4x2). Beyond the midplane level, all connections are optical. Racks have two midplanes, thus 32 compute drawers, for a total of 1024 compute nodes, 16,384 user cores, and 16 TB RAM.[42]
Separate I/O drawers, placed at the top of a rack or in a separate rack, are air cooled and contain 8 compute cards and 8 PCIe expansion slots for
InfiniBand or
10 Gigabit Ethernet networking.[42]
Performance
At the time of the Blue Gene/Q system announcement in November 2011,[43] an initial 4-rack Blue Gene/Q system (4096 nodes, 65536 user processor cores) achieved #17 in the
TOP500 list[1] with 677.1 TeraFLOPS Linpack, outperforming the original 2007 104-rack BlueGene/L installation described above. The same 4-rack system achieved the top position in the
Graph500 list[3] with over 250 GTEPS (giga
traversed edges per second). Blue Gene/Q systems also topped the
Green500 list of most energy efficient supercomputers with up to 2.1
GFLOPS/W.[2]
The following is an incomplete list of Blue Gene/Q installations. Per June 2012, the TOP500 list contained 20 Blue Gene/Q installations of 1/2-rack (512 nodes, 8192 processor cores, 86.35 TFLOPS Linpack) and larger.[1] At a (size-independent) power efficiency of about 2.1 GFLOPS/W, all these systems also populated the top of the June 2012
Green 500 list.[2]
A Blue Gene/Q system called
Sequoia was delivered to the
Lawrence Livermore National Laboratory (LLNL) beginning in 2011 and was fully deployed in June 2012. It is part of the
Advanced Simulation and Computing Program running nuclear simulations and advanced scientific research. It consists of 96 racks (comprising 98,304 compute nodes with 1.6 million processor cores and 1.6
PB of memory) covering an area of about 3,000 square feet (280 m2).[44] In June 2012, the system was ranked as the world's fastest supercomputer.[45][46] at 20.1
PFLOPS peak, 16.32
PFLOPS sustained (Linpack), drawing up to 7.9
megawatts of power.[1] In June 2013, its performance is listed at 17.17
PFLOPS sustained (Linpack).[1]
JUQUEEN at the
Forschungzentrum Jülich is a 28-rack Blue Gene/Q system, and was from June 2013 to November 2015 the highest ranked machine in Europe in the Top500.[1]
Vulcan at
Lawrence Livermore National Laboratory (LLNL) is a 24-rack, 5 PFLOPS (peak), Blue Gene/Q system that was commissioned in 2012 and decommissioned in 2019.[49] Vulcan served Lab-industry projects through Livermore's High Performance Computing (HPC) Innovation Center[50] as well as academic collaborations in support of DOE/National Nuclear Security Administration (NNSA) missions.[51]
Fermi at the
CINECA Supercomputing facility, Bologna, Italy,[52] is a 10-rack, 2 PFLOPS (peak), Blue Gene/Q system.
A five rack Blue Gene/Q system with additional compute hardware called AMOS was installed at Rensselaer Polytechnic Institute in 2013.[54] The system was rated at 1048.6 teraflops, the most powerful supercomputer at any private university, and third most powerful supercomputer among all universities in 2014.[55]
An 838 TFLOPS (peak) Blue Gene/Q system called Avoca was installed at the
Victorian Life Sciences Computation Initiative in June, 2012.[56] This system is part of a collaboration between IBM and VLSCI, with the aims of improving diagnostics, finding new drug targets, refining treatments and furthering our understanding of diseases.[57] The system consists of 4 racks, with 350 TB of storage, 65,536 cores, 64 TB RAM.[58]
A 209 TFLOPS peak (172 TFLOPS LINPACK) Blue Gene/Q system called Lemanicus was installed at the
EPFL in March 2013.[61] This system belongs to the Center for Advanced Modeling Science CADMOS ([62]) which is a collaboration between the three main research institutions on the shore of the
Lake Geneva in the French speaking part of Switzerland :
University of Lausanne,
University of Geneva and
EPFL. The system consists of a single rack (1,024 compute nodes) with 2.1
PB of IBM GPFS-GSS storage.
A half-rack Blue Gene/Q system, with about 100 TFLOPS (peak), called Cumulus was installed at A*STAR Computational Resource Centre, Singapore, at early 2011.[63]
Applications
Record-breaking science applications have been run on the BG/Q, the first to cross 10
petaflops of sustained performance. The cosmology simulation framework HACC achieved almost 14 petaflops with a 3.6 trillion particle benchmark run,[64] while the Cardioid code,[65][66] which models the electrophysiology of the human heart, achieved nearly 12 petaflops with a near real-time simulation, both on
Sequoia. A fully compressible flow solver has also achieved 14.4 PFLOP/s (originally 11 PFLOP/s) on Sequoia, 72% of the machine's nominal peak performance.[67]
^Appavoo, Jonathan; Uhlig, Volkmar; Waterland, Amos.
"Project Kittyhawk: Building a Global-Scale Computer"(PDF). Yorktown Heights, NY: IBM T.J. Watson Research Center. Archived from the original on 2008-10-31. Retrieved 2018-03-13.{{
cite web}}: CS1 maint: bot: original URL status unknown (
link)
^"IBM announces 20-petaflops supercomputer". Kurzweil. 18 November 2011. Retrieved 13 November 2012. IBM has announced the Blue Gene/Q supercomputer, with peak performance of 20 petaflops
^S. Habib; V. Morozov; H. Finkel; A. Pope;
K. Heitmann; K. Kumaran; T. Peterka; J. Insley; D. Daniel; P. Fasel; N. Frontiere & Z. Lukic (2012). "The Universe at Extreme Scale: Multi-Petaflop Sky Simulation on the BG/Q".
arXiv:1211.4864 [
cs.DC].