Slow PWLFCs
PWLFC is the current limit to the scalability of PW mode. Particularly the expand() mechanism sometimes takes a large amount of time compared to other things.
I am willing to do some work to improve the situation, but it would be nice to first decide how to do it.
I ran a benchmark for a 32-atom Ir surface. Excerpts from performance breakdown including some extra timings not normally printed:
SCF-cycle: 620.872 2.901 0.4% |
Davidson: 422.117 0.096 0.0% |
pwlfc integrate: 43.897 29.964 4.3% |-|
expand: 13.933 13.933 2.0% ||
pwlfc add: 43.203 29.543 4.3% |-|
expand: 13.660 13.660 2.0% ||
pt.integrate and pt.add take 87s, or 14% of the SCF time. 27s hereof are spent in expand().
Mix: 66.196 0.861 0.1% |
ghat add: 65.335 0.000 0.0% |
pwlfc add: 65.335 7.420 1.1% |
expand: 57.915 57.915 8.4% |--
I have previously claimed that the "mixer" was expensive because that's the name of the timing, but it is actually ghat.add whose expand() takes another 8.4% of all time in this case.
This is the term which really limits the scalability of PW mode, and is particularly annoying for sparse-ish systems (molecules, etc.) because ghat, unlike the projectors, is not affected by k-point sampling.
I notice that the PWLFC code is not really written for gamma-point/density stuff: It calculates a complex array but works on a float view of that.
Calculate atomic Hamiltonians: 2.940 0.003 0.0% |
ghat integrate: 2.937 0.000 0.0% |
pwlfc integrate: 2.937 0.496 0.1% |
expand: 2.441 2.441 0.4% |
For some reason ghat.integrate is much, much faster, I guess because something is done in parallel.
There are a few other operations where PWLFCs are used, but they are less important.
Some possible solutions:
-
Don't calculate the full complex array for density-like/float-type localized functions.
-
Use real-space LFCs for ghat in combination with the grid redistribution functionality to take advantage of parallelism. (I already implemented this, and it is much faster, but it does the whole back-and-forth redistribution at the moment and should instead make use of the same back-and-forth redistribution which it already performs for the XC part). Disadvantage: Affects numbers.
-
Rewrite expand() in C.
-
Mind the memory locality (whether in Python or C).
3b) Hardwire the G blocking at PWLFC object creation time so it can store and work on Y_LG and f_ajG in contiguous blocks over LG or GL. This is likely to help a lot, but commits us more to the current design of PWLFCs, and I don't know if there are other things that should first be changed.
-
Why is ghat.add so much slower than ghat.integrate - can ghat.add not be made as fast as ghat.integrate using the same technique?
-
Parallelize the whole thing over PWs (it is probably wiser to first do some of the above, and this would be a major project, too).
It is probably out of question to get rid of the expand() function by storing all the function values, because then the pt object becomes even bigger than the wavefunctions.
Do other DFT codes perform these operations differently?