Message boards :
News :
Reducing Workunits to Unreliable Hosts
Message board moderation
Author | Message |
---|---|
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey Everyone, I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know. Thank you all for your continued support. Jake |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
Hey Everyone, Thank you very much, hopefully the credits will flow more quickly now. |
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
I just tried turning on some options to reduce workunits sent to hosts that return a significant number of errors. If you see any issues, please let me know... Unfortunately, host 643627 went through another cycle this morning of returning 80 errored tasks and getting 80 new tasks. Even though the host has a history of returning a lot of errors, does it take time for the server to "learn" that it's unreliable? |
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
|
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
|
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
Another otherwise likely valid result of mine invalidated due to unreliable wingmen (628802 and 761112). |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey Everyone, I wanted to give this a couple days to see if the server just had to learn who was unreliable. Seems that's not the case. I will do a little more research into configuring the server better throughout the week and run some more tests. Jake |
Send message Joined: 25 Jan 14 Posts: 1 Credit: 17,492,399 RAC: 0 |
One of my PCs is not getting any credit for HOST AVERAGE work done, another seems to be working OK. The OK one also posts USER AVERAGE totals which seem to include activity of BOTH of my Milkyway machines as well as what looks like correct HOST AVERAGE totals. |
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
One of my PCs is not getting any credit for HOST AVERAGE work done, another seems to be working OK. The OK one also posts USER AVERAGE totals which seem to include activity of BOTH of my Milkyway machines as well as what looks like correct HOST AVERAGE totals. Just so there's no confusion, what does this have to do with unreliable hosts? For your hosts, I see some user aborts and errors in N-body tasks. However, no massive and continuing computation errors like those pointed out earlier in this thread. |
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
I will do a little more research into configuring the server better throughout the week and run some more tests... Any update on this, Jake? I've had another result invalidated due to unreliable wingmen. I guess it doesn't make any difference, but all the hosts I've looked appear to be using GPUs that aren't double precision. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Nothing new to report yet. I've tried changing a few things on the server config side, but none of it seems to make a difference. Hopefully I will find the right knob to tweak soon. Jake |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I will do a little more research into configuring the server better throughout the week and run some more tests... Maybe they could blacklist some then? Not the actual host but the model of gpu that can't do double precision. That should mean the host can't get work. Another project I crunch for bans any gpu with less than 2gb of ram, I realize that may be a bit easier but it's a start. |
Send message Joined: 28 Nov 14 Posts: 51 Credit: 86,696,721 RAC: 0 |
Hi Jake, While you're at it can you find out whats causing the database to crash every now and then? Regards, Cliff. -- Been there Done That, still no Damn T-Shirt |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey everyone, I found a small bug in the way that the scheduler tests for reliability on workunits with priority 0. I am going to try changing everything to priority 1 and see if that fixes things. Hopefully this won't change how quickly our workunits are processed or affect how it interacts with workunits from other projects. Let me know if you see any issues with this on your end. Jake |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
Hey everyone, Thanks I hope that helps! In the meantime maybe you can see if the pc is affected by it: http://milkyway.cs.rpi.edu/milkyway/results.php?hostid=763376 It got 128 tasks and trashed them ALL!! |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Says you have zero workunits in progress. That's a good sign! Jake |
Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,008,062,758 RAC: 2 |
All timeouts. That could have just been a failed hard drive or a PC that was turned off. The ones that have computational errors are the problem clients. |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Hey Everyone, I've done a little digging to see if unreliable hosts are getting fewer workunits now and that does seem to be the case. I have found several faulty computers who currently have 0 work units in progress where they had 100+ plus as of last week. I am going to consider this issue resolved for now unless anyone objects. Jake |
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
I am going to consider this issue resolved for now unless anyone objects... It definitely looks like we're making progress. I'd say we're good to go if 643627 doesn't get new work after it reports its latest batch sometime after 1800 UTC. |
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
I'd say we're good to go if 643627 doesn't get new work after it reports its latest batch sometime after 1800 UTC. Well, it's still being sent new work in spite of a 100% error rate. However, it's getting fewer and fewer new tasks so maybe it will be completely shut down in another couple of days. Good work, Jake. Thanks for chasing that down for us. |
©2025 Astroinformatics Group