We are currently using more than 72 nodes on the MOSIX system at the Hebrew University. The system has a theoretical speed of 5 Gflops and aggregate memory of 20 GBytes. The nodes are interconnected by a high performance, scalable, Myrinet switch interconnect that provides a low latency and high bandwidth communication.
The DUPSM code involves two computational phases: the building of the Hamiltonian matrix and the Lanczos diagonalization procedure. The use of the permutation symmetry group introduces additional (unconserved) labels to the basis states. This splits the Hamiltonian matrix into independent submatrices. These submatrices are distributed on different processors with dynamical load balancing, effectively applying a domain decomposition in Hilbert space. The building of the Hamiltonian matrix uses the bulk of the CPU time in large calculations. This part of the code scales almost perfectly with the number of processors. The Lanczos procedure used in the second phase has also been parallelized using the same data decomposition, and is quite efficient, despite the fact that it does not scale perfectly.
Akiva Novoselsky and Michel Vallières have worked extensively with the MOSIX group (Amnon Barak and collaborators) in analyzing the memory and CPU needs of the DUPSM code to optimize the use of the Cluster. This optimization process also enabled the MOSIX group to fine tune their cluster.
In the first version of the parallel code, each processor calculated several Hamiltonians submatrices (blocks) and stored them in RAM. Thus the non-zero Hamiltonian matrix elements were distributed among the RAM of the various nodes. There are two advantages to using RAM to store the Hamiltonian matrix elements. First, it is sufficient to calculate the Hamiltonian matrix only once. Second, we significantly diminish I/O to disk during the diagonalization procedure. However, the maximal size of the sparse matrices that we can calculate in this option is limited by the aggregate RAM of the system. In the MOSIX system for example (with 256 MB RAM/node and 64 nodes) this limit of matrix size is about 70,000.
Another option was added to the DUPSM code recently. The Hamiltonian matrix is not stored in RAM, but recomputed each time when it is needed for the calculation of the matrix-vector product used in the Lanczos procedure. This approach saves large amounts of memory since only the Lanczos vectors are stored. With this method we were able to calculate matrices of the order of hundreds of thousands. This removes the memory limitation at the expense of re-calculating the Hamiltonian matrix anew for each Lanczos iteration.
We also have an option to store the Hamiltonian matrix on /tmp on each node. This option allows to treat huge matrices. It is efficient on the Beowulf type parallel computers since the Hamiltonian matrix onlly needs to be calculated once. It requires on the other hand much I/O to disk. In this aspect, the Beowulf/IO type parallel computers are very efficient since the I/O can be performed by each node on its local disk. In this sense these systems are considered great parallel I/O machines.
We are currently looking for new algorithmic ways to reduce the time needed to compute the spectra. Improving our Lanczos procedure is a way to accomplish this goal. We are also pursuing a way improving the sparsity of the Hamiltonian matrix via the introduction of new quantum numbers to further block the matrix.
Back to DUSM global page |