A year later, RU super computer may be fully on-line. Or, maybe not.

The overheated Caliburn

Rutgers’ officials now promise that the  $10 million “supercomputer” that was destined to “revolutionize” computing at the state university–and throughout New Jersey– will be available to all users next month, a year after a coolant leak forced it to shut down only weeks after it was inaugurated with considerable public  fanfare.

“All system components will be configured in the modular data center during the holiday break and the full system will be open to users in January,” said a university spokeswoman in an email.

University officials have been predicting a rapid rehabilitation for just about a year–especially since this site revealed the supercomputer, named “Caliburn,” had to be shut down because of a coolant leak that threatened the machine with disastrous  overheating.  In February, for example, they predicted the system would be up in “10 days.”

The spokeswoman denied claims by some critics that university officials failed to monitor the system adequately.

“The modular datacenter (which housed the computer)  did have a coolant leak, which happens, but no one was monitoring it for trouble,” contended one source who asked to remain anonymous.

“Hence the problem snowballed and kept getting worse until someone happened to drive by and hear an alarm. ”

The university spokeswoman contended, “The computer system team responded to alarms they received from the monitoring system and the system shut down automatically, as expected, when the event occurred. ”

The university has been reluctant to provide details concerning, not just the accident that shut down the $10 million machine, but also the circumstances under which a local company with no past experience in building supercomputers–“High Point Solutions”–became the “lead” contractor in  building it.

The university also refused to provide–or, possibly, even ask High Point Solutions for–information concerning the company, its experience, and its finances; Rutgers did, however, ask for, receive and release that information about other companies (including giants IBM and Dell) involved in the project.  The information it did provide about High Point changed over the course of this site’s investigation into the flawed supercomputer–at first calling High Point the “lead” contractor and then minimizing its role, saying it was little more than a sales rep.

High Point Solutions Stadium

Whatever role High Point played, the university has flatly refused to say how much money High Point received, contending that the financial relationship between that company and another–SuperMicro, the San Jose and Taiwan firm that actually provided the hardware–could only be revealed by the two private companies. They have refused.

The sensitivity is not surprising. High Point is owned by major donors to Rutgers. The gifts included the $6.5 million purchase of naming rights to the university’s football stadium. The university has flatly refused to answer questions about the relationship between High Point’s owners and the role it played in building the computer, insisting only that the company was awarded a contract after a “competitive bidding process.”

The university won’t even say whether any other companies–including those with long experience in supercomputing–bid against High Point.

What’s also troubling about the shutdown of the flawed supercomputer is this: These machines, despite their multi-million price tags, have short lives. The previous “high performance” computer built at Rutgers–the so-called “Ex Calibur” –only lasted five years before Rutgers officials decided it had outlived its life expectancy.

A Time magazine discussion of supercomputing pointed out that a new supercomputer only is a “useful resource” for about five years–and its reign as a world class” machine is less than three. So a big chunk of Caliburn’s useful life already has been wasted–along with the money that paid for it.

What follows is the latest email exchange between this site and Dory Devlin of the Rutgers University Office of Public Information. This site opened the conversation with a statement provided by an anonymous source concerning how the computer came to be shut down:

 

From: Bob Braun <bob@bobbraunsledger.com> Sent: Wednesday, December 6, 2017 12:59 PM To: Dory Devlin Subject: FW: [Records Center] Open Public Records Act Request :: R003398-022317

Subject: RE: [Records Center] Open Public Records Act Request :: R003398-022317

Dory—I note that you haven’t responded to the question whether Rutgers violated its own bidding policies by allowing a third-party contractual arrangement between Supermicro and High Point Solutions. I hope you will answer.

Meanwhile, I received the following note:

“The modular datacenter did have a coolant leak, which happens, but no one was monitoring it for trouble. Hence the problem snowballed and kept getting worse until someone happened to drive by and hear an alarm. Who buys a $10 million supercomputer and doesn’t monitor it for trouble? That’s flat-out negligence. Supercomputers get very very hot, which is why cooling is so important, and reputable vendors (like Dell and HP) set the servers to automatically shut down if the temperature crosses a threshold. Supermicro didn’t do this, the room got blazing hot, and the machine roasted. High Point/Supermicro wanted to play in the big leagues and it turns out they can’t hit a curveball. This is primarily Supermicro’s fault, but it’s also on the technical staff to doublecheck that, particularly if you insist on buying and managing your own (unnecessary) modular datacenter yourself. The “reduced capacity” means that the Dell piece (~20%) shut down properly and didn’t get roasted so those are the only surviving servers. Not coincidentally, (the university) had a technical staff member who was competent and diligent and took care of the Dell part. He left and the…clownshow took over…”

The statement raises the following questions to which I would require answers:

1)      Is it true that “no one was monitoring” the modular data center “for trouble”?

2)      Is it true that the problem of the leak continued “until someone happened to drive by and hear an alarm”?

3)      Is it true that Supermicro did not “set the servers to automatically shut down” when the temperature got to a point where the servers were damaged or, as the write points out, “got roasted”?

4)      Is it true that the “reduced capacity” means the Dell piece of the center did shut down properly “so those are the only surviving servers”? Is it true that represents only about 20 percent of the superocmputers’ operation?

5)      Is it true the cost of the project was over budget? If so, by how much?

6)      When will Caliburn return to full capacity?

(Devlin responded):

Rutgers followed university bidding policies on the Caliburn project. As coordinator of the sale and purchase of Super Micro Computer, Inc.’s equipment and services to Rutgers, High Point Solutions Inc. signed an IT Professional Service Provider Agreement. Super Micro’s RFP included High Point Solutions as the local distributor to handle the purchase process. All bidders for the project were associated with local distributors.   The computer system team responded to alarms they received from the monitoring system and the system shut down automatically, as expected, when the event occurred.    The project was completed on budget and did not exceed it.   Caliburn is currently fully operational across two locations. Previously, all of the ELF system (Dell) and 10 percent of the Caliburn system (Super Micro) have been available since the spring at a different location and have been actively used daily by researchers. Currently, all of Caliburn is being used by researchers and is undergoing testing and performance validation. All system components will be reconfigured in the modular data center during the holiday break and the full system will be open to users in January.

Note: This site removed some personal references from the anonymous reader.

READ THE COMPLETE SERIES OF ARTICLES ABOUT THE RUTGERS SUPERCOMPUTER:

Breaking: New Jersey’s largest supercomputer, one of the world’s biggest, is down. 2/17/17 

New Jersey’s $10 million computer crash: Questions keep piling on keep piling on. 2/23/17 — 

New Jersey’s $10 million computer crash: Rutgers won’t answer any more questions. 2/24/17

The RU Computer: What football, hype, confusion, politics and lots of money gave to NJ. 3/3/17–

THE RU/HIGH POINT STONEWALL: Why won’t Rutgers release public records? 3/10/17

Why isn’t the Rutgers Stadium named for Paul Robeson? 3/12/17

RUTGERS FLAWED SUPERCOMPUTER: Was favoritism to donors a factor? 10/1/17

 

 

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.