New Poll Regarding GPU Application of N-Body

Author	Message
Link Send message Joined: 19 Jul 10 Posts: 832 Credit: 21,854,963 RAC: 8,149	Message 74958 - Posted: 29 Jan 2023, 14:17:12 UTC - in response to Message 74955. Last modified: 29 Jan 2023, 14:19:56 UTC I think you are confused. There is no N-body gpu app so no N-body gpu code. I'm afraid you are confused this time, there is an n-Body GPU application, but it was never released as it needs same time on the GPU as the CPU application on CPU while separation is 50-60 times faster on GPU than on CPU. See the first message of this thread. Releasing this application as is would slow down the overall throughput of the project. ID: 74958 · Rating: 0 · rate: / Reply Quote

Speedy51 Send message Joined: 12 Jun 10 Posts: 57 Credit: 6,527,559 RAC: 397	Message 74961 - Posted: 29 Jan 2023, 20:23:28 UTC - in response to Message 74958. Releasing this application as is would slow down the overall throughput of the project. Exactly the reason why N body application does not have a GPU version ID: 74961 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3339 Credit: 524,398,788 RAC: 79	Message 74962 - Posted: 30 Jan 2023, 1:00:51 UTC - in response to Message 74961. Releasing this application as is would slow down the overall throughput of the project. Exactly the reason why N body application does not have a GPU version On top of not making a bit of sense to even have it around, what's the point if it's not as efficient as the current cpu app, they already have a gpu app here so don't NEED to release a slow inefficient one. ID: 74962 · Rating: 0 · rate: / Reply Quote

Link Send message Joined: 19 Jul 10 Posts: 832 Credit: 21,854,963 RAC: 8,149	Message 74971 - Posted: 31 Jan 2023, 10:59:36 UTC - in response to Message 74962. On top of not making a bit of sense to even have it around, what's the point if it's not as efficient as the current cpu app, they already have a gpu app here so don't NEED to release a slow inefficient one. Of course to speed up n-Body processing they could offer the possibility to not "Run CPU versions of applications for which GPU versions are available" and credits similar to Separation would surely help too, right now when crunching n-Body one might get the impression, it isn't very valuable type of work considering the "pay" we get for it. ;-) ID: 74971 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3339 Credit: 524,398,788 RAC: 79	Message 74973 - Posted: 31 Jan 2023, 12:35:25 UTC - in response to Message 74971. On top of not making a bit of sense to even have it around, what's the point if it's not as efficient as the current cpu app, they already have a gpu app here so don't NEED to release a slow inefficient one. Of course to speed up n-Body processing they could offer the possibility to not "Run CPU versions of applications for which GPU versions are available" and credits similar to Separation would surely help too, right now when crunching n-Body one might get the impression, it isn't very valuable type of work considering the "pay" we get for it. ;-) I agree the credits aren't up to par and several Admins have said they will look at it but they always come back 'we're happy where they are' which isn't helpful at all for those doing the actual crunching. ID: 74973 · Rating: 0 · rate: / Reply Quote

reiner Send message Joined: 25 May 23 Posts: 13 Credit: 58,073 RAC: 0	Message 75687 - Posted: 18 Jun 2023, 21:09:57 UTC IÂ´d imagine that many current GPU users have got a lot of FP64 capable GPUs for MW@H, since the seperation tasks needed FP64. If I understand correctly, the n-body calculations also need FP64. Leaving out these GPUs would leave a lot of compute-power unused. Regarding effificency - I am new to BOINC/MQ@H, but in the time I contributed so far, I compared my older GCN1 Cards with older Threadripper CPUs for the seperation tasks... And the GPU was A LOT more efficient - even an older R9 280x with 1:4 Ratio was able to be around 30 times faster than my Threaripper while only consuming a fration of the watts when tuned down a little. Efficiency of GPUs goes up quite a bit if the power target is lowered a little... With around 40watts I was able to complete on task in around one minute on the GPU... So here is a thought from someone not being a programmer: How about converting the n_body to OPENCL? This way, CPUs and all GPUs could contribute - having one universal codebase, being easy to maintain. (DonÂ´t know if this is applicable to the mw@h variant of nbody, but I have seen a lot of OPENCL Benchmarks using Nbody on both GPU and CPU, so it seems possible ...) This also would make the use of internal GPUs of CPUs usable - some Iris XE Intel iGPUs actually do pack a lot of punch in FP64 with very high efficiency (around 20 watts for 600 TLOPS in FP64), and even some UHD Variants are very efficient. While in raw numbers inferiiour to some dedicated GPUs, per Watt, they can easily hold up with 1080s, smaller RTX, GCN 5, etc........ Imagine every now unused small Notebook contributing as much FP64 power as midrange Gaming PCs (per Watt...) I know that CUDA often is favoured because it often is easier to work with and sometimes offers better speeds - but as far as I understand it, OPENCL can also be very fast and powerfull, it only needs more fiddling when coding.. Going CUDA would leave out all AMD / Intel /mobile chips and CPUs as well. With OPENCL, even the FP64 pwowerfull Oldies of GCN 1 and 2 could still be used. Or the Intel XEON PHIs could contribute (they actually are not so bad for FP64...). I just recently did see some fluid dynamics simulations from this guy, he also provides an OPEN CL wrapper, making some stuff in OPEN CL a lot easier - its a long shot and I have no idea if this makes sense (not being a programmer), but maybe he has some ideas on how to accelerate the NBODY Implementation of MW@H in OPENCL ? https://github.com/ProjectPhysX/FluidX3D ID: 75687 · Rating: 0 · rate: / Reply Quote

Falconet Send message Joined: 9 Mar 09 Posts: 17 Credit: 880,663 RAC: 777	Message 75688 - Posted: 18 Jun 2023, 21:49:55 UTC - in response to Message 75687. They did build an OpenCL app for Nvidia and AMD. However, as you can read in the OP, it wasn't faster than the CPU app so it did not make sense to deploy it. They are open to it but only if it's significantly faster than CPUs. From this message: For anyone who wants to play around with the GPU code, the current GPU code for Nbody can be found at the sheilsGPU branch (https://github.com/Milkyway-at-home/milkywayathome_client/tree/sheilsGPU). It is out of date compared to the CPU version of Nbody though, so you will need to compare against master to see what changes have been made since then. If you are able to produce a meaningful speedup for widely-used GPU architecture (and share the source code), we would be happy to consider implementing any upgrades you come up with! It is hard to make speedups for generic GPU architecture (which is what we would prefer to do as a BOINC project), but if most people are using a specific type of GPU that can get 50-100x speed up compared to the CPU code, we would definitely be interested. ID: 75688 · Rating: 0 · rate: / Reply Quote

reiner Send message Joined: 25 May 23 Posts: 13 Credit: 58,073 RAC: 0	Message 75689 - Posted: 18 Jun 2023, 22:00:43 UTC I have no clue whether I am on the wrong path here, but I just remembered that compubench does nbody as part of the benchmarks in opencl or cuda - and in opencl, CPUs and GPUs are tested... Looking at the table of results, even older or smaller GPUs outperform CPUs in this task.. So for me it seems that doing nBODY on GPUs is worth it.. Or am I missing somehting here? https://compubench.com/result.jsp?benchmark=compu20d&test=725&text-filter=&order=median&ff-desktop=true&os-Windows_cl=true&os-Windows_cu=true&pu-dGPU=true&pu-iGPU=true&pu-ACC=true&arch-x86=true&base=device ID: 75689 · Rating: 0 · rate: / Reply Quote

Falconet Send message Joined: 9 Mar 09 Posts: 17 Credit: 880,663 RAC: 777	Message 75690 - Posted: 18 Jun 2023, 22:50:45 UTC - in response to Message 75689. Read the message I typed before yours. ID: 75690 · Rating: 0 · rate: / Reply Quote

reiner Send message Joined: 25 May 23 Posts: 13 Credit: 58,073 RAC: 0	Message 75692 - Posted: 19 Jun 2023, 8:56:36 UTC - in response to Message 75690. Read the message I typed before yours. I did :) ID: 75692 · Rating: 0 · rate: / Reply Quote

Falconet Send message Joined: 9 Mar 09 Posts: 17 Credit: 880,663 RAC: 777	Message 75701 - Posted: 19 Jun 2023, 12:19:11 UTC - in response to Message 75692. We can only hope someone will be kind enough to improve the OpenCL app into something much, much faster than it is right now. ID: 75701 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 218 Credit: 111,183,008 RAC: 982	Message 75728 - Posted: 19 Jun 2023, 23:51:02 UTC - in response to Message 75701. We can only hope someone will be kind enough to improve the OpenCL app into something much, much faster than it is right now. And someone would probably have to commit to amending the OpenCL code and/or the OpenCL-related code in the CPU-based part of the application whenever there was a relevant change to the science code in the "CPU-only" application code :-) I get the impression that this application isn't one of those simple ones where the GPU-based calculations are more or less unchangeable and can be completely controlled via parameter settings. If all it is doing is (for instance) FFTs or simple optimization of a matrix, the only issue would be "Can it be made efficient enough to make it worth doing?" However, if it would either entail lots of shuffling data around on the GPU or frequent movement of data to and from the GPU between GPU-worthy sections of computation that might be a completely different matter! And, of course, if part of making it efficient enough entails "messing" with support libraries or adding hacks to facilitate using the GPU for one task whilst another one is doing CPU-intensive stuff, that would have to be done very carefully, especially if end users might be using their GPUs for more than one BOINC application -- I recall an issue over at WCG a while back which was to do with something in that area :-( Without an expert eye on the code, we can't know what performance issues (global memory usage, bandwidth between motherboard and GPU, et cetera) there might be. Whilst I share the hope that it might be possible to do something GPU-wise, I would be unsurprised if an [unbiased] expert outsider decides it isn't worth it... Cheers - Al. ID: 75728 · Rating: 0 · rate: / Reply Quote

.clair. Send message Joined: 3 Mar 13 Posts: 84 Credit: 779,527,712 RAC: 0	Message 75734 - Posted: 20 Jun 2023, 0:59:57 UTC One thing I would like to know is , does the N-body mt cpu app use SSE or AVX optimization , if it is of any use for what they do , I have not seen any mention of it . it shure does speed up cpu crunching at other projects that developed app code for it . ID: 75734 · Rating: 0 · rate: / Reply Quote

Speedy51 Send message Joined: 12 Jun 10 Posts: 57 Credit: 6,527,559 RAC: 397	Message 75740 - Posted: 20 Jun 2023, 3:31:59 UTC - in response to Message 75701. We can only hope someone will be kind enough to improve the OpenCL app into something much, much faster than it is right now. If somebody is able to provide me with the code and tell me where to put it I would be more than happy to do so :-) ID: 75740 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 739 Credit: 575,350,377 RAC: 123,525	Message 75741 - Posted: 20 Jun 2023, 6:33:21 UTC - in response to Message 75740. Tom, provided the source code location for you already. https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=5007&postid=75603 Have at it. ID: 75741 · Rating: 0 · rate: / Reply Quote

reiner Send message Joined: 25 May 23 Posts: 13 Credit: 58,073 RAC: 0	Message 75742 - Posted: 20 Jun 2023, 9:32:44 UTC - in response to Message 75728. We can only hope someone will be kind enough to improve the OpenCL app into something much, much faster than it is right now. And someone would probably have to commit to amending the OpenCL code and/or the OpenCL-related code in the CPU-based part of the application whenever there was a relevant change to the science code in the "CPU-only" application code :-) I have no knowledge how difficult it would be to port CPU to GPU - my idea was to stay in the OPENCL Realm - a CPU can perfectly run OPENCL stuff, if OPENCL Drivers are installed... But I admit that its not as straight forward as using a GPU - since every GPU driver already has OPENCL bundled - its a no brainer... With CPU, one would have to seperately install OPENCL for CPU.. Not sure if a portable OPENCL/CPU driver could be an option... Regarding SSE/AVX etc.... I have some workloads, that really benefit from AVX512 in one of my CPUs - but I am no expert whether nBody would benefit from it. When looking at the code from my amateur mind, I see SSE2 as the bare minimum..... this issue probably also is a strategic decision: The lower the minimum requirements are set, the more people can contribute - the more modern CPU / GPU Features are supported, the better mote efficient modern gear becomes.. I have seen this on inferences running image scalers - once the tensor cores are suppoted, even number crunching beasts like Radeon VII Pro stand no chance.. (But FP64 in Tensor cores is not available in consumer cards, so this example probably is not a valid one for this specific n-body case....). ID: 75742 · Rating: 0 · rate: / Reply Quote

reiner Send message Joined: 25 May 23 Posts: 13 Credit: 58,073 RAC: 0	Message 75743 - Posted: 20 Jun 2023, 9:39:01 UTC - in response to Message 75740. We can only hope someone will be kind enough to improve the OpenCL app into something much, much faster than it is right now. If somebody is able to provide me with the code and tell me where to put it I would be more than happy to do so :-) Great ! Thanx! If something emerges that can simply be run on windows; IÂ´d be happy to toss it onto some CPUs/GPUs for comparisons. ID: 75743 · Rating: 0 · rate: / Reply Quote