|
As node counts for
terascale systems grow to tens of thousands, with petascale system
likely to
contain hundreds of thousands of nodes, we
must rethink traditional assumptions about software scaling and manageability
and hardware reliability. These challenges are exacerbated by the appearance
of multicore chips, two-way and four-way now, but with hundred-way cores
projected. In addition, a tsunami of new experimental and computational
data poses equally vexing problems in analysis, transport and visualization.
Collectively, these scaling challenges create power, cooling, reliability
and performance challenges that will require new approaches if we are
to realize the potential of petascale systems. Our thesis is that the “two
worlds” of software – distributed systems and parallel systems – must
meet, embodying ideas from each, if we are to build resilient systems.
This talk will describe recent experiments on power and environmental
monitoring, statistical sampling and reliability that suggest possible
solutions. |