Message boards :
News :
Nbody WU Flush
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 22 · Next
Author | Message |
---|---|
![]() Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 ![]() ![]() |
The server has been brought back up. I'm going to wait for the server status page to update and then make an assessment of the situation. Hopefully things will be clearing out and I can turn the Nbody WU generator back on. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,903,201 RAC: 1 ![]() ![]() |
That looks a lot better…. Thanks |
![]() Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 ![]() ![]() |
I was able to bring the number of Nbody WUs down to a reasonable number. I'm going to see how quickly they decrease, and then turn the Nbody WU generator on once it's reasonable to do so. |
![]() Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 ![]() ![]() |
OK. The n body backlog is out of control. I think I will abort all separation _0 CPU work for a week or so. This will free up 36 CPU threads for the N Body WU Flush. Any wingman separation tasks that are _1 and above, will be processed normally. All GPU separations will also be processed normally. Separation retests will also go through normally.I just got told by the server there were no Nbodys left, so perhaps Tom cleared it? I'm also getting loads of GPU separation every time I ask so I think the server is ok now and you can run what you like. 716 aborted by the time I saw your post here. could go another 30, but will hold off for a bit. |
![]() Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217 ![]() ![]() ![]() |
I was able to bring the number of Nbody WUs down to a reasonable number. I'm going to see how quickly they decrease, and then turn the Nbody WU generator on once it's reasonable to do so.Don't forget to turn up the GPU seperation limits like you promised me :) Pretty please? The 10 minute delay otherwise is fustrating. |
![]() Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 ![]() ![]() |
Don't forget to turn up the GPU seperation limits like you promised me :) All done! I'm not sure if the changes will take place instantaneously or if they will need a reset. If it needs to be reset, that will happen soon when I turn on the Nbody WU generator (probably tomorrow or Thursday, based on how the numbers are trending) |
![]() Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217 ![]() ![]() ![]() |
Thankyou! I just tested it and got a limit of 900 tasks for a 3 GPU machine as before, so I guess the reset needs to happen first. My new GPUs won't be here for a few days anyway, but by then I want 6 280X to be running non stop on Milkyway!Don't forget to turn up the GPU seperation limits like you promised me :)All done! I'm not sure if the changes will take place instantaneously or if they will need a reset. If it needs to be reset, that will happen soon when I turn on the Nbody WU generator (probably tomorrow or Thursday, based on how the numbers are trending) When I ask for CPU work I seem to get mainly seperation. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 109,643,674 RAC: 1,618 ![]() ![]() ![]() |
Tom, A couple of questions now normality may have been restored :-) Is there some way of limiting the number of new work units being created to stop any further possibility of the unsent tasks getting flooded again? I mentioned the idea in the Server trouble thread, where it merely provoked a critique of BOINC software... I realize this next one isn't about N-body, but rather than posting in two places... Is there a particular reason that it seems that all Separation work units that don't validate the first result at once go on to need two retries, and all three results end up validated? It seems to have been doing this ever since the disk problem, and several folks have commented on it in various threads. I've looked at the stderr.txt of quite a few of the tasks where I'm waiting for the second retry to return, and in almost every case the values from both results seem to be well within the sort of tolerances that I thought would get passed without question (differences are out around the 12th decimal place!) Cheers - Al. P.S. here's hoping there'll be enough Separation work to allow those who don't want/need a giant data feed to get work after the big hitters have grabbed the increased numbers of tasks :-) |
![]() Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 ![]() ![]() |
Is there some way of limiting the number of new work units being created to stop any further possibility of the unsent tasks getting flooded again? I mentioned the idea in the Server trouble thread, where it merely provoked a critique of BOINC software... There is one in place, actually. It normally keeps the number of jobs ready to be sent at ~10k (like how separation is self-monitoring right now). This just got screwed up somehow with the disk problem and spiraled out of control. Also, is there a particular reason that it seems that all Separation work units that don't validate the first result at once go on to need two retries, and all three results end up validated? It seems to have been doing this ever since the disk problem, and several folks have commented on it in various threads. I've looked at the stderr.txt of quite a few of the tasks where I'm waiting for the second retry to return, and in almost every case the values from both results seem to be well within the sort of tolerances that I thought would get passed without question (differences are out around the 12th decimal place!) That is by design. We can choose how many times Separation WUs need to go out for validation, and whoever came before me thought that 2 retries was the correct number. But yes, the tolerance changes should be small because they should just depend on how different operating systems execute the code. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 109,643,674 RAC: 1,618 ![]() ![]() ![]() |
Is there some way of limiting the number of new work units being created to stop any further possibility of the unsent tasks getting flooded again? I mentioned the idea in the Server trouble thread, where it merely provoked a critique of BOINC software... Thanks for the response, Tom; the only question that raises (specifically for Separation!) is whether that limit might need to be raised if the number of tasks sent per request goes up for big hitters!. Just saw your edit in response to my second question too... I wondered whether it was a deliberate choice or not; it just had an unfortunate side-effect whilst retries were severely delayed, but it's less of an issue now normal service seems to have been resumed! Cheers - Al. [Edited to acknowledge added response to second question...] |
![]() Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217 ![]() ![]() ![]() |
Thanks for the response, Tom; the only question that raises (specifically for Separation!) is whether that limit might need to be raised if the number of tasks sent per request goes up for big hitters!.Don't worry, as a big hitter if I don't get my huge number I will make it known. I will be watching the number of tasks received at once.... |
Send message Joined: 14 Feb 22 Posts: 12 Credit: 2,956,525 RAC: 0 ![]() ![]() ![]() |
Tom Donlon wrote: I was able to bring the number of Nbody WUs down to a reasonable number. I'm going to see how quickly they decrease, and then turn the Nbody WU generator on once it's reasonable to do so.I noticed some of my WUs in a "validation inconclusive" status that had a second task created to be assigned have had their second task cancelled ("Didn't need"), while the WU continues to remain in the "Validation inconclusive" state. For e.g., WU 414046094. Will such work units get a new task created and assigned once the NBody WU generator is turned on so that these can be validated? |
![]() ![]() Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 ![]() ![]() ![]() |
I am also confused by this, my seperation GPU WUs have: That setting has been around FOREVER and is because ALOT of people had problems with their machines and didn't complete the tasks for some reason so to speed up the process they did that, I would have thought it wasn't needed today but I guess it is. |
![]() Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217 ![]() ![]() ![]() |
Perhaps it would be more efficient if the deadline was decreased but reduced to 1 or 2 replication? So if you don't do yours, it wouldn't be long before they were resent to me.I am also confused by this, my seperation GPU WUs have: |
![]() Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217 ![]() ![]() ![]() |
TOM: What's the status of things now? I see Nbody on server status is sitting at the correct exact 1000. Is the generator back on? I'm still getting limited to 300 seperation per GPU :-( Also see my reply to Mikey above about efficiency. |
Send message Joined: 29 Oct 11 Posts: 1 Credit: 344,448 RAC: 0 ![]() ![]() |
I have about 200 WU`s with that status |
![]() Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 ![]() ![]() |
Hmm, I did change the values within the config.xml file on the server and then rebooted everything. Maybe I changed the wrong settings... I'll check what's up and see if anything obvious jumps out at me. |
![]() Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 ![]() ![]() |
I found where to change the individual limits, but I don't know why my changes to the feeder/scheduler pool never took effect. I'll have to dig around some more because that seems funky. I can't boost the number of WUs we can send out to a single volunteer until I raise the pool numbers, because otherwise one person could drain the pool in one go and there wouldn't be any WUs left until the pool refilled. |
![]() Send message Joined: 5 Jul 11 Posts: 991 Credit: 376,793,951 RAC: 17,217 ![]() ![]() ![]() |
Thanks Tom. Maybe you could ask on a Boinc forum? |
Send message Joined: 23 Sep 13 Posts: 19 Credit: 36,223,867 RAC: 0 ![]() ![]() |
Why are we getting new N-Body work units when we have a lot of Validation Inconclusive work units that still need to be validated |
©2025 Astroinformatics Group