Welcome to MilkyWay@home

Validator Outage

Message boards : News : Validator Outage
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

AuthorMessage
muca

Send message
Joined: 21 May 20
Posts: 3
Credit: 22,194,788
RAC: 0
Message 71223 - Posted: 8 Oct 2021, 8:30:06 UTC - in response to Message 71148.  
Last modified: 8 Oct 2021, 8:43:32 UTC

@Tom: It seems they still comming...
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=203048065
Is it possible somehow cancel it on server side?
ID: 71223 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 71243 - Posted: 13 Oct 2021, 16:01:06 UTC

These "Invalids" are really becoming irritating. I am encountering about 450 per day and each is consuming 1.625 of the normal computer time. Are we making any progress on eliminating the 7 WUs tasks?
I won't even bitch about the "Error While Computing " Tasks. They are not Errors While Computing since no computing is ever done. Actually, they are Initialization Errors. I don't care what they are I just want them to go away.
On the Invalids, I am fairly certain that tasks that end up as 7 WU are sent as 4 WU. The assessment 7 WUs is made by the clients' computers. Maybe what we need is a subroutine in Initialization that tests the WU Count before computation starts and if it is not 4 aborts the run. I don't like creating "workarounds" as fixes but it would be better than we have today.
ID: 71243 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 71244 - Posted: 13 Oct 2021, 18:15:34 UTC - in response to Message 71243.  

These "Invalids" are really becoming irritating. I am encountering about 450 per day and each is consuming 1.625 of the normal computer time. Are we making any progress on eliminating the 7 WUs tasks?
I won't even bitch about the "Error While Computing " Tasks. They are not Errors While Computing since no computing is ever done. Actually, they are Initialization Errors. I don't care what they are I just want them to go away.
On the Invalids, I am fairly certain that tasks that end up as 7 WU are sent as 4 WU. The assessment 7 WUs is made by the clients' computers. Maybe what we need is a subroutine in Initialization that tests the WU Count before computation starts and if it is not 4 aborts the run. I don't like creating "workarounds" as fixes but it would be better than we have today.


The separation source file at "milkywayathome_client/separation/separation_main.c"

Has the following code line: "mw_printf("<number_WUs> %d </number_WUs>\n", ap.totalWUs);"

putting the following under it would abort all 7 parameter work units
if (ap.totalWUs == 7)
{
exit(EXIT_FAILURE)
}

This would avoid crunching 7 WU tasks but you would end up with a lot of "error" which might cause the daily quota to be exceeded. The number of GPUs can always be faked to get more but I suspect it is better if the project guru can fix the problem

It has been over a year since I last built the client
https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4551#69402

It appears there have been a few changes since then.
https://github.com/Milkyway-at-home/milkywayathome_client
ID: 71244 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71245 - Posted: 13 Oct 2021, 20:28:51 UTC

Turns out there are a few WUs with 4x the number of parameters that got stuck in the parent population for some runs. Absolutely no idea how that happened, but I should be able to make it so that it truncates that down to the normal number when it generates new WUs.

I can explain a bit more when I implement a fix, might try to get that done tonight. We'll see.

Technical bits below:

When the XML file (technically, its the thing that writes to the XML file) computes the number of parameters in a job, it only considers the number of parameters in the first WU and ignores the number in the other bundles. In this case, if the first WU has the wrong number (say, 4x as many parameters) then it will estimate incorrectly. This means that the number of parameters that the XML file estimates is (4x4)x26 instead of (4)x26, so you get 416 instead of the expected 104.

However, when the XML generator calculates the number of WUs, it takes the total number of parameters (actually (1x4)x26 + (3x1)x26 = 182, because the other 3 bundled WUs are normal) and when you divide that by 26, you get 7. Which is why the XML file thinks that there are 7 WUs but 416 parameters, when 416 parameters should actually be 16 bundled WUs.

I should figure out why these 182-parameter WUs happen, but I'll have to do that later and just try to implement a patch right now. This only became noticeable because the faulty WUs happened to return good likelihood scores (not sure if those are legitimate or not -- the first 26 parameters look realistic, but the other 26x3 are just scraped from who knows what in memory... could be that the program only looks at the first 26 when calculating the likelihood of that WU, so that run would actually be good.) Alternatively I could just try to figure out how to purge those members from the population (or just remove the extra 3x26 parameters), which should in theory stop the bad WUs from being generated.

I know that it's later than you all would have liked, but know that I've heard your pleas and I'm trying to do something about it.
ID: 71245 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71246 - Posted: 13 Oct 2021, 21:31:55 UTC
Last modified: 14 Oct 2021, 3:40:19 UTC

I've gone through and truncated all of the faulty population members. Nothing seems broken as far as I can tell. Over the next few days, you should hopefully see a decrease in the number of invalid runs, since all WUs generated from this point forward should have the correct size -- if not, I'll have to take some more drastic measures.

Technical Bits Below:

Curiously, the faulty members were all in the 0th spot in the population. I think that's because when generating the new WUs from the parents, it takes the size of the parent with lower id number -- so, if these weird WUs aren't in the 0th place, their children won't have the wrong size, and any replacements will be guaranteed to remove that wrong size WU (although this is just a guess, could totally be wrong on this one). Also, curiously, these runs that are currently up were the only runs to have been affected by this. The server code hasn't changed at all since I ran previous runs (except for the validator, which changed after this problem began) so I'm not sure what's up with that. But the same issue was present in every single run.

I'll keep an eye on things and see if the 0th spot gets updated with a oversized WU again. Hopefully this is all resolved, but maybe not.

Thanks for your patience with this, I have a thousand projects all going on so it's not always easy for me to get around to this one!
ID: 71246 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71247 - Posted: 14 Oct 2021, 17:50:15 UTC
Last modified: 14 Oct 2021, 17:50:53 UTC

The validation error rate is slowly falling (down 15% in number of invalidate errors in the last 4 hours). Currently the server is at about ~3.2% invalidate rate, which is down from 3.5% 4 hours ago. Hopefully this trend continues.

I expect that in a few days - a week we see things return to normal as all of the faulty WUs pass through the system.
ID: 71247 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Septimus

Send message
Joined: 8 Nov 11
Posts: 205
Credit: 2,900,464
RAC: 0
Message 71250 - Posted: 15 Oct 2021, 6:52:10 UTC - in response to Message 71247.  

Thanks, errors down, had 3 yesterday, 2 were due to the 7 WU problem, 1 presumably genuine.
ID: 71250 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Max_Pirx

Send message
Joined: 13 Dec 17
Posts: 46
Credit: 2,421,362,376
RAC: 0
Message 71251 - Posted: 15 Oct 2021, 17:33:58 UTC

Yep, invalid tasks are going down for me too. From 640 couple of days ago down to 380 at the moment. Thanks for sorting this out and cheers!
ID: 71251 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 71252 - Posted: 17 Oct 2021, 14:35:12 UTC

Tom Donlon, you done good.
In the last 24 hours I have completed 11,000 tasks and experienced 2 Invalids (7 WU) and O Errors While Computing..
ID: 71252 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Joseph Stateson
Avatar

Send message
Joined: 18 Nov 08
Posts: 291
Credit: 2,461,693,501
RAC: 0
Message 71253 - Posted: 17 Oct 2021, 14:57:01 UTC

Really appreciate the effort to fix this!
Out of 17,000+ only 5 in last 24 hours.
ID: 71253 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 71254 - Posted: 19 Oct 2021, 15:06:40 UTC

I thought it must be Halloween already since I was frightened. The dreaded 7 WU Invalids were back in their hundreds ( actually 120s). All the corrupt tasks were sent at 6:00 UTC on October 19, 2021.
Hopefully, all of the bad tasks had been buried in the reservoir of "Waiting to be sent tasks". If so, their numbers will diminish over the new few days. Hopefully.
ID: 71254 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71256 - Posted: 19 Oct 2021, 22:31:19 UTC

Thanks for letting me know. One of the runs got an oversized WU stuck in it again. I guess it's not just enough to clear them out once, they need to be watched and cleaned out whenever this happens.

I've cleared this one one, hopefully it doesn't happen to the other runs. Please let me know if you start getting a lot of these errors again after a day or so.
ID: 71256 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 71283 - Posted: 28 Oct 2021, 14:56:36 UTC

7 WU tasks are back and becoming more prevalent. Over the past 24 hours I have encountered 60. There is still trouble in River City.
ID: 71283 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71284 - Posted: 28 Oct 2021, 17:38:04 UTC

Cleared. Thank you for letting me know.
ID: 71284 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 1 Dec 10
Posts: 82
Credit: 15,452,009,012
RAC: 0
Message 71285 - Posted: 29 Oct 2021, 2:10:54 UTC

The invalid tasks are rising again since the last problem, in my case rising from 424 to 863 in 24 hrs. The last time this went to over 7500 before coming back down. What an absolute utter waste of everything.

Why is there always such an increase after the recurrring problem that is fixed and then it is not and then is fixed and then it is not, this is the 3rd time in as many months.

?????
ID: 71285 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71286 - Posted: 29 Oct 2021, 14:05:19 UTC

Once the problem WU enters the population, a lot of tasks are generated using that WU as a parent. All of those WUs will have the wrong number of parameters.

Someone will have to look at the source code to figure out exactly what is going on, but for right now I'm satisfied with clearing problematic WUs as they come up. Apologies for any inconvenience this causes.
ID: 71286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Frank

Send message
Joined: 2 Nov 10
Posts: 25
Credit: 1,894,269,109
RAC: 0
Message 71288 - Posted: 30 Oct 2021, 2:53:00 UTC - in response to Message 71286.  

I don't disagree with any thing you put in your last message; but I am wondering when the source code can be be worked on.
You have validated how the 7 WU Tasks are coming from within the Project's domain. This problem showed up in September. It had been doing well up that point. What changed?
Maybe all you have to do is reinstall the program that is causing the corruption. Or, maybe a subroutine can be added to the client app to make sure that if the number of WUs is greater than 4 the task is aborted.
Believe or not 7 WU Tasks cost the client 1.62 more computation and power and there are a total of three clients that are going to have to run these corrupt tasks. They are not free. We crunchers need a fix.
ID: 71288 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 1 Dec 10
Posts: 82
Credit: 15,452,009,012
RAC: 0
Message 71290 - Posted: 30 Oct 2021, 22:37:40 UTC - in response to Message 71288.  

Hi Frank,
You have hit the nail squarely on the head!
"They are not free. We crunchers need a fix".
ID: 71290 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tom Donlon
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 10 Apr 19
Posts: 408
Credit: 120,203,200
RAC: 0
Message 71292 - Posted: 1 Nov 2021, 2:35:47 UTC

I will take a day to look at the source code again when I am recovered from my surgery and we are finished working on our NSF proposal. I am hoping that it is something that can be fixed on the server side of things, and the Separation client code doesn't need to be rebuilt.
ID: 71292 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Spatzthecat

Send message
Joined: 1 Dec 10
Posts: 82
Credit: 15,452,009,012
RAC: 0
Message 71294 - Posted: 1 Nov 2021, 9:47:39 UTC - in response to Message 71292.  

Hello Tom,
I am sure I speak for all in wishing you a speedy recovery from your surgery!
ID: 71294 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next

Message boards : News : Validator Outage

©2024 Astroinformatics Group