(Sierra supercomputer at Lawrence Livermore National Laboratory in California.)
As the US competes with China to build the fastest supercomputers, you might be wondering how these giant machines are being used.
A supercomputer can contain hundreds of thousands of processor cores and require an entire building to house and cool—not to mention millions of dollars to create and maintain them. But despite these challenges, more and more are set to go online as the US and China develop new “exascale” supercomputers, which promise a five-fold performance boost compared to current leading systems.
So who needs all this computing power and why? To find out, PCMag visited the Lawrence Livermore National Laboratory in California, which is home to several supercomputers, including the world’s second fastest, Sierra. It was there we learned how system engineers are maintaining the machines to serve scientific researchers but also test something you might not expect: nuclear weapons.
A Classified System
About 1,000 people maintain the lab’s supercomputers and create programs for them.
When you visit Sierra, you’ll notice the words “classified” and “secret restricted data” posted on the supercomputer, which is made up of 240 server-like racks. The warnings exist because Sierra is processing data involving the US’s nuclear stockpile, including how the weapons should detonate in the real world.
The US conducted its last live nuclear weapons test in 1992. Since then, the country has used supercomputers to help carry out the experiments virtually, and Sierra is part of that mission. The machine was completed last year primarily to aid the US government in monitoring and testing the effectiveness of the country’s aging nuclear arsenal, which needs to be routinely maintained.
“The only way a deterrent works is if you know that it can function, and that your adversary also knows and believes it functions,” said Adam Bertsch, a high performance computing systems engineer at the lab.
Examples of simulations performed at the lab’s supercomputing center. On the left is a fusion energy research experiment involving heating and compressing a fuel target with 192 lasers. On the right is a hydrodynamics-related simulation of a ‘triple-point shock interaction.’
Not surprisingly, simulating a nuclear explosion requires a lot of math. Foundational principles in science can predict how particles will interact with each other under different conditions. The US government also possesses decades of data collected from real nuclear tests. Scientists have combined this information to create equations inside computer models, which can calculate how a nuclear explosion will go off and change over time.
Essentially, you’re trying to map out a chain reaction. So to make the models accurate, they’ve been designed to predict a nuclear detonation at molecular levels using real-world physics. The challenge is that calculating what all these particles will do requires a lot of number-crunching.
Enter Sierra. The supercomputer has 190,000 CPU processor cores and 17,000 GPU cores. All that computing power means it can take a huge task, like simulating nuclear fission, and break it down into smaller pieces. Each core can then process a tiny chunk of the simulation and communicate the results to the rest of the machine. The process will repeat over and over again as the supercomputer tries to model a nuclear explosion from one second to the next.
“You can do a full simulation of a nuclear device in the computer,” Bertsch added. “You can find out that it works, exactly how well it works and what kind of effects would happen.”
A Research Machine
Cable clusters help Sierra exchange data. Other cables contain water to keep the system cool.
A supercomputer’s ability to calculate and model particle interactions is why it’s become such an important tool for researchers. In a sense, reactions are happening all around us. This can include the weather, how a star forms, or when human cells come in contact with a drug.
A supercomputer can simulate all these interactions. Scientists can then take the data to learn useful insights, like whether it’ll rain tomorrow, if a new scientific theory is valid, or if an upcoming cancer treatment holds any promise.
The same technologies can also let industries explore countless new designs and figure out which ones are worth testing in the real world. It’s why the lab has experienced huge demand for its two dozen supercomputers.
“No matter how much computing power we’ve had, people would use it up and ask for more,” Bertsch said.
It also explains why the US government wants an exascale supercomputer. The extra computing power will allow scientists to develop more advanced simulations, like recreating even smaller particle interactions, which could pave the way for new research breakthroughs. The exascale systems will also be able to complete current research projects in less time. “What you previously had to spend months doing might only take hours,” Bertsch added.
A researcher connects with a supercomputer at the lab online via a Linux PC. A ‘job’ can be queued up by simply using a command line application.
Sierra is part of a classified network not connected to the public internet, which is available to about 1,000 approved researchers in affiliated scientific programs. About 3,000 people conduct research on unclassified supercomputers, which are accessible online provided you have a user account and the right login credentials. (Sorry, Bitcoin miners.)
“We have people buy into the computer at the acquisition time,” Bertsch said. “The amount of money you put in correlates to the percentage of the machine you bought.”
A scheduling system is used to ensure your “fair share” with the machine. “It tries to steer your usage toward the percentage you’ve been allocated,” Bertsch added. “If you used less than your fair share over time, your priority goes up and you’ll run sooner.”
Simulations are always running. One supercomputer can run thousands of jobs at any given time. A machine can also process what’s called a “hero run,” or a single job that’s so big the entire supercomputer is required to complete it in a reasonable time.
Keeping It Up And Running
The guts of another supercomputer, Sequoia. One rack is not too different from a server.
Sierra is a supercomputer, but the machine has largely been made with commodity parts. The processors, for example, are enterprise-grade chips from IBM and Nvidia, and the system itself runs Red Hat Enterprise Linux, a popular OS among server vendors.
“Back in the day, supercomputers were these monolithic big, esoteric blobs of hardware,” said Robin Goldstone, the lab’s high performance computing solution architect. “These days, even the world’s biggest systems are essentially just a bunch of servers connected together.”
To maximize its use, a system like Sierra needs to be capable of conducting different kinds of research. So the lab set out to create an all-purpose machine. But even a supercomputer isn’t perfect. The lab estimates that every 12 hours Sierra will suffer an error that can involve a hardware malfunction. That may sound surprising, but think of it as owning 100,000 computers; failures and repairs are inevitable.
“The most common things that fail are probably memory DIMMs, power supplies, fans,” Goldstone said. Fortunately, Sierra is so huge, it has plenty of capacity. The supercomputer is also routinely creating memory backups in the event an error disrupts a project.
“To some degree, this isn’t exactly like a PC you have at home, but a flavor of that,” Goldstone added. “Take the gamers who are obsessed with getting the fastest memory, and the fastest GPU, and that’s the same thing we’re obsessed with. The challenge with us is we have so many running at the same time.”
Below the supercomputers is a piping system that sends up room-temperature water to keep the machines cool. Sierra is 80 percent water-cooled, 20 percent air-cooled.
Sierra itself sits in a 47,000-square-foot room, which is filled with the noise of fans keeping the hardware cool. A level below the machine is the building’s water pumping system. Each minute, it can send thousands of gallons into pipes, which then feed into the supercomputer’s racks and circulates water back out.
On the power front, the lab has been equipped to supply 45 megawatts—or enough electricity for a small city. About 11 of those megawatts have been delegated to Sierra. However, a supercomputer’s power consumption can occasionally spark complaints from local energy companies. When an application crashes, a machine’s energy demands can suddenly drop several megawatts.
The energy supplier “does not like that at all. Because they have to shed load. They are paying for power,” Goldstone said. “They’ve called us up on the phone and said, ‘Can you not do that anymore?'”
The Exascale Future
Last year, Sequoia ranked as the 10th fastest supercomputer in the world. But it will soon be replaced by a more powerful machine.
The Lawrence Livermore National Lab is also home to another supercomputer called Sequoia, which briefly reigned as the world’s top system back in 2012. But the lab plans to retire it later this year to make way for a bigger and better supercomputer, called El Capitan, which is among the exascale supercomputers the US government has been planning.
Expect it to go online in 2023. But it won’t be alone. El Capitan will join two other exascale systems, which the US is spending over $1 billion to construct. Both will be completed in 2021 at separate labs in Illinois and Tennessee.
“At some point, I keep thinking, ‘Isn’t it fast enough? How much faster do we really need these computers to be?'” Goldstone said. “But it’s more about being able to solve problems faster or study problems at higher resolution, so we can really see something at the molecular levels.”
But the supercomputing industry will eventually need to innovate. It’s simply unsustainable to continue building bigger machines that eat up more power and take more physical room. “We’re pushing the limits of what today’s technology can do,” she said. “There’s going to have to be advances in other areas beyond traditional silicon-based computing chips to take us to that next level.”
In the meantime, the lab has been working with vendors such as IBM and Nvidia to resolve immediate bottlenecks, including improving a supercomputer’s network architecture so it can quickly communicate across the different clusters, as well as component reliability. “Processor speed just doesn’t matter anymore,” she added. “As fast as the processors are, we’re constrained by memory bandwidth.”
The lab will announce more details about El Capitan in the future. As for the computer it’s replacing, Sequoia, the system is headed for oblivion.
For security purposes, the lab plans to grind up every piece of the machine and recycling its remains. Supercomputers can end up running classified government data, so it’s vital any trace of that information is completely purged—even if it means turning the machine into scrap. That may sound extreme, but errors can be made when trying to delete the data virtually, so the lab needs to be absolutely sure the data is permanently gone.