High Performance Linpack

This page summarizes information i have on HPL. Note: The HPL Tuning guide was obtained from a source that i forgotten when i archived the file. Thanks to whoever wrote it in the first place.

Introduction

Some information obtained for HPL tuning. The last 2 lines are the ENV to export for modifying the socket buffer and the memory glob size. Some other things includes N is the matrix length, and P * Q = the number of nodes. With Q always set to be > P

Sample Input file

This is the sample input of hpl on a 32 node cluster. The HPL.dat follows.

Note: For a 3 node cluster start with N=5000

-- BEGIN HPL.dat --
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
44000        Ns
1            # of NBs
80          NBs
1            # of process grids (P x Q)
8            Ps
8            Qs
16.0        threshold
1            # of panel fact
1            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
2            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
3            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64          swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
-- END HPL.dat --

Sample HPL Output

===========================================================================
HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27, 2000
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
===========================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB    : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time  : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :  25000
NB    :      80
P      :      8
Q      :      8
PFACT  :  Right
NBMIN  :      4
NDIV  :      2
RFACT  :  Right
BCAST  :  2ringM
DEPTH  :      1
SWAP  : Mix (threshold = 64)
L1    : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
  1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
  2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
  3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be          1.110223e-16
- Computational tests pass if scaled residuals are less than          16.0

============================================================================
T/V                N    NB    P    Q              Time            Gflops
----------------------------------------------------------------------------
W13R2R4        25000    80    8    8            539.04          1.933e+01
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0400536 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0098898 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0017031 ...... PASSED
============================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
----------------------------------------------------------------------------

End of Tests.
============================================================================

MPICH Tuning for HPL

We need to configure MPICH to be able to use the maximum amount of memory available in the system.

export P4_SOCKBUFSIZE=0x40000
export P4_GLOBMEMSIZE=16777296

Papers

HPL Paper

Himiko Server

High Performance Linpack

Introduction

Sample HPL Output

MPICH Tuning for HPL

Papers

R.A.M. Fragmented…