Message boards :
News :
Validator Outage
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 . . . 6 · Next
Author | Message |
---|---|
Send message Joined: 21 May 20 Posts: 3 Credit: 22,194,828 RAC: 1 |
@Tom: It seems they still comming... https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=203048065 Is it possible somehow cancel it on server side? |
Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0 |
These "Invalids" are really becoming irritating. I am encountering about 450 per day and each is consuming 1.625 of the normal computer time. Are we making any progress on eliminating the 7 WUs tasks? I won't even bitch about the "Error While Computing " Tasks. They are not Errors While Computing since no computing is ever done. Actually, they are Initialization Errors. I don't care what they are I just want them to go away. On the Invalids, I am fairly certain that tasks that end up as 7 WU are sent as 4 WU. The assessment 7 WUs is made by the clients' computers. Maybe what we need is a subroutine in Initialization that tests the WU Count before computation starts and if it is not 4 aborts the run. I don't like creating "workarounds" as fixes but it would be better than we have today. |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
These "Invalids" are really becoming irritating. I am encountering about 450 per day and each is consuming 1.625 of the normal computer time. Are we making any progress on eliminating the 7 WUs tasks? The separation source file at "milkywayathome_client/separation/separation_main.c" Has the following code line: "mw_printf("<number_WUs> %d </number_WUs>\n", ap.totalWUs);" putting the following under it would abort all 7 parameter work units if (ap.totalWUs == 7) { exit(EXIT_FAILURE) } This would avoid crunching 7 WU tasks but you would end up with a lot of "error" which might cause the daily quota to be exceeded. The number of GPUs can always be faked to get more but I suspect it is better if the project guru can fix the problem It has been over a year since I last built the client https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4551#69402 It appears there have been a few changes since then. https://github.com/Milkyway-at-home/milkywayathome_client |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Turns out there are a few WUs with 4x the number of parameters that got stuck in the parent population for some runs. Absolutely no idea how that happened, but I should be able to make it so that it truncates that down to the normal number when it generates new WUs. I can explain a bit more when I implement a fix, might try to get that done tonight. We'll see. Technical bits below: When the XML file (technically, its the thing that writes to the XML file) computes the number of parameters in a job, it only considers the number of parameters in the first WU and ignores the number in the other bundles. In this case, if the first WU has the wrong number (say, 4x as many parameters) then it will estimate incorrectly. This means that the number of parameters that the XML file estimates is (4x4)x26 instead of (4)x26, so you get 416 instead of the expected 104. However, when the XML generator calculates the number of WUs, it takes the total number of parameters (actually (1x4)x26 + (3x1)x26 = 182, because the other 3 bundled WUs are normal) and when you divide that by 26, you get 7. Which is why the XML file thinks that there are 7 WUs but 416 parameters, when 416 parameters should actually be 16 bundled WUs. I should figure out why these 182-parameter WUs happen, but I'll have to do that later and just try to implement a patch right now. This only became noticeable because the faulty WUs happened to return good likelihood scores (not sure if those are legitimate or not -- the first 26 parameters look realistic, but the other 26x3 are just scraped from who knows what in memory... could be that the program only looks at the first 26 when calculating the likelihood of that WU, so that run would actually be good.) Alternatively I could just try to figure out how to purge those members from the population (or just remove the extra 3x26 parameters), which should in theory stop the bad WUs from being generated. I know that it's later than you all would have liked, but know that I've heard your pleas and I'm trying to do something about it. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I've gone through and truncated all of the faulty population members. Nothing seems broken as far as I can tell. Over the next few days, you should hopefully see a decrease in the number of invalid runs, since all WUs generated from this point forward should have the correct size -- if not, I'll have to take some more drastic measures. Technical Bits Below: Curiously, the faulty members were all in the 0th spot in the population. I think that's because when generating the new WUs from the parents, it takes the size of the parent with lower id number -- so, if these weird WUs aren't in the 0th place, their children won't have the wrong size, and any replacements will be guaranteed to remove that wrong size WU (although this is just a guess, could totally be wrong on this one). Also, curiously, these runs that are currently up were the only runs to have been affected by this. The server code hasn't changed at all since I ran previous runs (except for the validator, which changed after this problem began) so I'm not sure what's up with that. But the same issue was present in every single run. I'll keep an eye on things and see if the 0th spot gets updated with a oversized WU again. Hopefully this is all resolved, but maybe not. Thanks for your patience with this, I have a thousand projects all going on so it's not always easy for me to get around to this one! |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
The validation error rate is slowly falling (down 15% in number of invalidate errors in the last 4 hours). Currently the server is at about ~3.2% invalidate rate, which is down from 3.5% 4 hours ago. Hopefully this trend continues. I expect that in a few days - a week we see things return to normal as all of the faulty WUs pass through the system. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
Thanks, errors down, had 3 yesterday, 2 were due to the 7 WU problem, 1 presumably genuine. |
Send message Joined: 13 Dec 17 Posts: 46 Credit: 2,421,362,376 RAC: 0 |
Yep, invalid tasks are going down for me too. From 640 couple of days ago down to 380 at the moment. Thanks for sorting this out and cheers! |
Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0 |
Tom Donlon, you done good. In the last 24 hours I have completed 11,000 tasks and experienced 2 Invalids (7 WU) and O Errors While Computing.. |
Send message Joined: 18 Nov 08 Posts: 291 Credit: 2,461,693,501 RAC: 0 |
Really appreciate the effort to fix this! Out of 17,000+ only 5 in last 24 hours. |
Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0 |
I thought it must be Halloween already since I was frightened. The dreaded 7 WU Invalids were back in their hundreds ( actually 120s). All the corrupt tasks were sent at 6:00 UTC on October 19, 2021. Hopefully, all of the bad tasks had been buried in the reservoir of "Waiting to be sent tasks". If so, their numbers will diminish over the new few days. Hopefully. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Thanks for letting me know. One of the runs got an oversized WU stuck in it again. I guess it's not just enough to clear them out once, they need to be watched and cleaned out whenever this happens. I've cleared this one one, hopefully it doesn't happen to the other runs. Please let me know if you start getting a lot of these errors again after a day or so. |
Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0 |
7 WU tasks are back and becoming more prevalent. Over the past 24 hours I have encountered 60. There is still trouble in River City. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Cleared. Thank you for letting me know. |
Send message Joined: 1 Dec 10 Posts: 82 Credit: 15,452,009,012 RAC: 0 |
The invalid tasks are rising again since the last problem, in my case rising from 424 to 863 in 24 hrs. The last time this went to over 7500 before coming back down. What an absolute utter waste of everything. Why is there always such an increase after the recurrring problem that is fixed and then it is not and then is fixed and then it is not, this is the 3rd time in as many months. ????? |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Once the problem WU enters the population, a lot of tasks are generated using that WU as a parent. All of those WUs will have the wrong number of parameters. Someone will have to look at the source code to figure out exactly what is going on, but for right now I'm satisfied with clearing problematic WUs as they come up. Apologies for any inconvenience this causes. |
Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0 |
I don't disagree with any thing you put in your last message; but I am wondering when the source code can be be worked on. You have validated how the 7 WU Tasks are coming from within the Project's domain. This problem showed up in September. It had been doing well up that point. What changed? Maybe all you have to do is reinstall the program that is causing the corruption. Or, maybe a subroutine can be added to the client app to make sure that if the number of WUs is greater than 4 the task is aborted. Believe or not 7 WU Tasks cost the client 1.62 more computation and power and there are a total of three clients that are going to have to run these corrupt tasks. They are not free. We crunchers need a fix. |
Send message Joined: 1 Dec 10 Posts: 82 Credit: 15,452,009,012 RAC: 0 |
Hi Frank, You have hit the nail squarely on the head! "They are not free. We crunchers need a fix". |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I will take a day to look at the source code again when I am recovered from my surgery and we are finished working on our NSF proposal. I am hoping that it is something that can be fixed on the server side of things, and the Separation client code doesn't need to be rebuilt. |
Send message Joined: 1 Dec 10 Posts: 82 Credit: 15,452,009,012 RAC: 0 |
Hello Tom, I am sure I speak for all in wishing you a speedy recovery from your surgery! |
©2024 Astroinformatics Group