Nbody WU Flush

Author	Message
Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 73018 - Posted: 19 Apr 2022, 18:56:47 UTC The server has been brought back up. I'm going to wait for the server status page to update and then make an assessment of the situation. Hopefully things will be clearing out and I can turn the Nbody WU generator back on. ID: 73018 · Rating: 0 · rate: / Reply Quote

Septimus Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,903,201 RAC: 1	Message 73019 - Posted: 19 Apr 2022, 19:56:48 UTC - in response to Message 73018. That looks a lot betterâ€¦. Thanks ID: 73019 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 73020 - Posted: 19 Apr 2022, 20:04:07 UTC I was able to bring the number of Nbody WUs down to a reasonable number. I'm going to see how quickly they decrease, and then turn the Nbody WU generator on once it's reasonable to do so. ID: 73020 · Rating: 0 · rate: / Reply Quote

HRFMguy Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0	Message 73021 - Posted: 19 Apr 2022, 21:03:05 UTC - in response to Message 73012. OK. The n body backlog is out of control. I think I will abort all separation _0 CPU work for a week or so. This will free up 36 CPU threads for the N Body WU Flush. Any wingman separation tasks that are _1 and above, will be processed normally. All GPU separations will also be processed normally. Separation retests will also go through normally. I'm starting to get worked up into campaign mode here. Thoughts? I just got told by the server there were no Nbodys left, so perhaps Tom cleared it? I'm also getting loads of GPU separation every time I ask so I think the server is ok now and you can run what you like. I try not to do seperation on the CPUs anyway, as the GPUs are many many times faster and it seems a waste. It's a pity the server options don't let me choose that completely. 716 aborted by the time I saw your post here. could go another 30, but will hold off for a bit. ID: 73021 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217	Message 73022 - Posted: 19 Apr 2022, 21:05:52 UTC - in response to Message 73020. Last modified: 19 Apr 2022, 21:06:26 UTC I was able to bring the number of Nbody WUs down to a reasonable number. I'm going to see how quickly they decrease, and then turn the Nbody WU generator on once it's reasonable to do so. Don't forget to turn up the GPU seperation limits like you promised me :) Pretty please? The 10 minute delay otherwise is fustrating. ID: 73022 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 73023 - Posted: 19 Apr 2022, 23:26:24 UTC - in response to Message 73022. Don't forget to turn up the GPU seperation limits like you promised me :) All done! I'm not sure if the changes will take place instantaneously or if they will need a reset. If it needs to be reset, that will happen soon when I turn on the Nbody WU generator (probably tomorrow or Thursday, based on how the numbers are trending) ID: 73023 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217	Message 73025 - Posted: 19 Apr 2022, 23:35:39 UTC - in response to Message 73023. Last modified: 19 Apr 2022, 23:37:23 UTC Don't forget to turn up the GPU seperation limits like you promised me :) All done! I'm not sure if the changes will take place instantaneously or if they will need a reset. If it needs to be reset, that will happen soon when I turn on the Nbody WU generator (probably tomorrow or Thursday, based on how the numbers are trending) Thankyou! I just tested it and got a limit of 900 tasks for a 3 GPU machine as before, so I guess the reset needs to happen first. My new GPUs won't be here for a few days anyway, but by then I want 6 280X to be running non stop on Milkyway! When I ask for CPU work I seem to get mainly seperation. ID: 73025 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 213 Credit: 109,643,674 RAC: 1,618	Message 73026 - Posted: 20 Apr 2022, 0:13:19 UTC - in response to Message 73023. Last modified: 20 Apr 2022, 0:16:07 UTC Tom, A couple of questions now normality may have been restored :-) Is there some way of limiting the number of new work units being created to stop any further possibility of the unsent tasks getting flooded again? I mentioned the idea in the Server trouble thread, where it merely provoked a critique of BOINC software... I realize this next one isn't about N-body, but rather than posting in two places... Is there a particular reason that it seems that all Separation work units that don't validate the first result at once go on to need two retries, and all three results end up validated? It seems to have been doing this ever since the disk problem, and several folks have commented on it in various threads. I've looked at the stderr.txt of quite a few of the tasks where I'm waiting for the second retry to return, and in almost every case the values from both results seem to be well within the sort of tolerances that I thought would get passed without question (differences are out around the 12th decimal place!) Cheers - Al. P.S. here's hoping there'll be enough Separation work to allow those who don't want/need a giant data feed to get work after the big hitters have grabbed the increased numbers of tasks :-) ID: 73026 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 73027 - Posted: 20 Apr 2022, 0:15:08 UTC - in response to Message 73026. Last modified: 20 Apr 2022, 0:17:49 UTC Is there some way of limiting the number of new work units being created to stop any further possibility of the unsent tasks getting flooded again? I mentioned the idea in the Server trouble thread, where it merely provoked a critique of BOINC software... There is one in place, actually. It normally keeps the number of jobs ready to be sent at ~10k (like how separation is self-monitoring right now). This just got screwed up somehow with the disk problem and spiraled out of control. Also, is there a particular reason that it seems that all Separation work units that don't validate the first result at once go on to need two retries, and all three results end up validated? It seems to have been doing this ever since the disk problem, and several folks have commented on it in various threads. I've looked at the stderr.txt of quite a few of the tasks where I'm waiting for the second retry to return, and in almost every case the values from both results seem to be well within the sort of tolerances that I thought would get passed without question (differences are out around the 12th decimal place!) That is by design. We can choose how many times Separation WUs need to go out for validation, and whoever came before me thought that 2 retries was the correct number. But yes, the tolerance changes should be small because they should just depend on how different operating systems execute the code. ID: 73027 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 213 Credit: 109,643,674 RAC: 1,618	Message 73028 - Posted: 20 Apr 2022, 0:20:48 UTC - in response to Message 73027. Last modified: 20 Apr 2022, 0:28:56 UTC Is there some way of limiting the number of new work units being created to stop any further possibility of the unsent tasks getting flooded again? I mentioned the idea in the Server trouble thread, where it merely provoked a critique of BOINC software... There is one in place, actually. It normally keeps the number of jobs ready to be sent at ~10k (like how separation is self-monitoring right now). This just got screwed up somehow with the disk problem and spiraled out of control. Thanks for the response, Tom; the only question that raises (specifically for Separation!) is whether that limit might need to be raised if the number of tasks sent per request goes up for big hitters!. Just saw your edit in response to my second question too... I wondered whether it was a deliberate choice or not; it just had an unfortunate side-effect whilst retries were severely delayed, but it's less of an issue now normal service seems to have been resumed! Cheers - Al. [Edited to acknowledge added response to second question...] ID: 73028 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217	Message 73029 - Posted: 20 Apr 2022, 0:24:24 UTC - in response to Message 73028. Last modified: 20 Apr 2022, 0:25:00 UTC Thanks for the response, Tom; the only question that raises (specifically for Separation!) is whether that limit might need to be raised if the number of tasks sent per request goes up for big hitters!. Don't worry, as a big hitter if I don't get my huge number I will make it known. I will be watching the number of tasks received at once.... ID: 73029 · Rating: 0 · rate: / Reply Quote

AnandBhat Send message Joined: 14 Feb 22 Posts: 12 Credit: 2,956,525 RAC: 0	Message 73031 - Posted: 20 Apr 2022, 2:33:54 UTC - in response to Message 73020. Tom Donlon wrote: I was able to bring the number of Nbody WUs down to a reasonable number. I'm going to see how quickly they decrease, and then turn the Nbody WU generator on once it's reasonable to do so. I noticed some of my WUs in a "validation inconclusive" status that had a second task created to be assigned have had their second task cancelled ("Didn't need"), while the WU continues to remain in the "Validation inconclusive" state. For e.g., WU 414046094. Will such work units get a new task created and assigned once the NBody WU generator is turned on so that these can be validated? ID: 73031 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0	Message 73034 - Posted: 20 Apr 2022, 10:20:56 UTC - in response to Message 73008. I am also confused by this, my seperation GPU WUs have: minimum quorum 2 initial replication 3 Why send out 3 if only 2 are needed? This seems like a waste of processing time. If two of us have agreed on the result, why get a third GPU to run it? That setting has been around FOREVER and is because ALOT of people had problems with their machines and didn't complete the tasks for some reason so to speed up the process they did that, I would have thought it wasn't needed today but I guess it is. ID: 73034 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217	Message 73036 - Posted: 20 Apr 2022, 10:37:43 UTC - in response to Message 73034. I am also confused by this, my seperation GPU WUs have: minimum quorum 2 initial replication 3 Why send out 3 if only 2 are needed? This seems like a waste of processing time. If two of us have agreed on the result, why get a third GPU to run it? That setting has been around FOREVER and is because ALOT of people had problems with their machines and didn't complete the tasks for some reason so to speed up the process they did that, I would have thought it wasn't needed today but I guess it is. Perhaps it would be more efficient if the deadline was decreased but reduced to 1 or 2 replication? So if you don't do yours, it wouldn't be long before they were resent to me. ID: 73036 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217	Message 73037 - Posted: 20 Apr 2022, 10:39:05 UTC TOM: What's the status of things now? I see Nbody on server status is sitting at the correct exact 1000. Is the generator back on? I'm still getting limited to 300 seperation per GPU :-( Also see my reply to Mikey above about efficiency. ID: 73037 · Rating: 0 · rate: / Reply Quote

pawg Send message Joined: 29 Oct 11 Posts: 1 Credit: 344,448 RAC: 0	Message 73060 - Posted: 21 Apr 2022, 9:53:45 UTC - in response to Message 73031. I have about 200 WU`s with that status ID: 73060 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 73106 - Posted: 25 Apr 2022, 23:37:36 UTC Hmm, I did change the values within the config.xml file on the server and then rebooted everything. Maybe I changed the wrong settings... I'll check what's up and see if anything obvious jumps out at me. ID: 73106 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 73108 - Posted: 25 Apr 2022, 23:51:06 UTC - in response to Message 73106. I found where to change the individual limits, but I don't know why my changes to the feeder/scheduler pool never took effect. I'll have to dig around some more because that seems funky. I can't boost the number of WUs we can send out to a single volunteer until I raise the pool numbers, because otherwise one person could drain the pool in one go and there wouldn't be any WUs left until the pool refilled. ID: 73108 · Rating: 0 · rate: / Reply Quote

Mr P Hucker Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217	Message 73109 - Posted: 26 Apr 2022, 0:59:03 UTC Thanks Tom. Maybe you could ask on a Boinc forum? ID: 73109 · Rating: 0 · rate: / Reply Quote

Robert Coplin Send message Joined: 23 Sep 13 Posts: 19 Credit: 36,223,867 RAC: 0	Message 73110 - Posted: 26 Apr 2022, 1:11:54 UTC Why are we getting new N-Body work units when we have a lot of Validation Inconclusive work units that still need to be validated ID: 73110 · Rating: 0 · rate: / Reply Quote