Message boards :
News :
Validator Outage
Message board moderation
Author | Message |
---|---|
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Hey Everyone, MilkyWay@home is currently experiencing an outage for the Separation Validator. I brought it back up once, and then it crashed again. I am trying to bring it back up. In the meantime, connections to the download/upload servers may stop and start intermittently as I work on the server. Thanks for your patience, I will keep you updated as things change. Tom |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I've gotten to the root of the problem. One of the WUs with 7 sub-tasks (see https://milkyway.cs.rpi.edu/milkyway/forum_thread.php?id=4762) entered into the validator, which crashed because the list of parameters is too long to fit in a mysql string in the database. Every time I try to bring the validator up, it does actually come up, and then crashes because it runs into the same problematic WU that is trying to get validated. I have no idea why WUs with bundles of 7 tasks are even getting sent out, but at the meantime I may be able to clear out this particular WU and get the validator running again. If there's another 7-subtask WU down the line (very possible based on conversations I've had), I expect that the validator will crash again. I think I can patch a fix into the validator to avoid these overly-long parameter WUs, but I can't do that until I get back to NY, which means that the separation validator may be down until at least tomorrow. In the meantime, if you are stuck waiting for the validator, you can always crunch N-Body WUs or temporarily switch to another project. We appreciate your volunteer time very much, and thank you for letting me know that there was a problem with the validator to begin with. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I cancelled the 35k jobs that had 7 bundled tasks, but the validator is still stuck on the same WU. If you run into a little lost credit (should be a fraction of a percent) it's probably because I killed those jobs. |
Send message Joined: 2 Sep 09 Posts: 4 Credit: 15,268,335 RAC: 0 |
Thank your for your dedication, it is appreciated ! |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I think that I've fixed the validator outage. The validator will now just skip over WUs with parameters that are too large to fit in the database. Unfortunately, it will have to mark them as invalid, so this may lower RAC by ~1%. The next step is figuring out why those over-bundled WUs happen in the first place and stopping that. Thanks for your patience! EDIT: The validator is returning expected results for Separation as of UTC 19:33 today. The validator has a backlog of a couple million WUs to crunch through, so you will probably see a spike in credit over a little while as those are validated. |
Send message Joined: 13 Oct 16 Posts: 112 Credit: 1,174,293,644 RAC: 0 |
Thanks for the updates and quick solution! |
Send message Joined: 13 Dec 17 Posts: 46 Credit: 2,421,362,376 RAC: 0 |
Cheers for working out a solution so quickly. Nice work! |
Send message Joined: 21 May 20 Posts: 3 Credit: 22,194,788 RAC: 0 |
One difference between I have found. All Invalid WUs have <number_WUs> 7 </number_WUs> <number_params_per_WU> 26 </number_params_per_WU> but validated <number_WUs> 4 </number_WUs> <number_params_per_WU> 26 </number_params_per_WU> maybe it's a case? |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Yeah, that's how I identified which jobs to kill. The validator does it a little differently, but it's always the jobs with no. WUs >= 7 that will invalidate. |
Send message Joined: 16 Mar 10 Posts: 12 Credit: 22,284,745 RAC: 0 |
I am getting a lot of Invalid tasks (see below) with significant loss of computer time: Task Work unit Computer Sent Time reported Status Run time CPU time Credit Application 341345645 188500118 705499 23 Sep 2021, 20:01:50 UTC 24 Sep 2021, 16:02:57 UTC Validate error 7,469.61 7,421.86 --- Milkyway@home Separation v1.46 windows_x86_64 340954782 188147268 705499 23 Sep 2021, 12:48:50 UTC 24 Sep 2021, 8:51:49 UTC Validate error 7,371.70 7,371.70 --- Milkyway@home Separation v1.46 windows_x86_64 340492122 189425464 705499 23 Sep 2021, 3:35:34 UTC 23 Sep 2021, 23:53:35 UTC Validate error 7,478.13 7,459.58 --- Milkyway@home Separation v1.46 windows_x86_64 339950653 183681961 705499 22 Sep 2021, 16:35:36 UTC 23 Sep 2021, 12:38:40 UTC Validate error 7,390.94 7,390.94 --- Milkyway@home Separation v1.46 windows_x86_64 339881594 190781418 705499 22 Sep 2021, 15:19:27 UTC 23 Sep 2021, 11:30:45 UTC Validate error 7,372.12 7,367.42 --- Milkyway@home Separation v1.46 windows_x86_64 339786777 190693366 705499 22 Sep 2021, 13:30:24 UTC 23 Sep 2021, 9:25:52 UTC Validate error 7,366.25 7,366.25 --- Milkyway@home Separation v1.46 windows_x86_64 339227791 190181707 705499 22 Sep 2021, 3:02:01 UTC 22 Sep 2021, 22:57:53 UTC Validate error 7,684.13 7,490.67 --- Milkyway@home Separation v1.46 windows_x86_64 338697271 189713157 705499 21 Sep 2021, 16:59:41 UTC 22 Sep 2021, 12:42:17 UTC Validate error 7,405.41 7,391.31 --- Milkyway@home Separation v1.46 windows_x86_64 337266954 188484311 705499 20 Sep 2021, 15:13:35 UTC 21 Sep 2021, 11:33:58 UTC Validate error 7,380.12 7,378.81 --- Milkyway@home Separation v1.46 windows_x86_64 337121921 188350085 705499 20 Sep 2021, 12:37:19 UTC 21 Sep 2021, 8:38:04 UTC Validate error 7,356.50 7,346.25 --- Milkyway@home Separation v1.46 windows_x86_64 337108544 188337337 705499 20 Sep 2021, 12:27:06 UTC 21 Sep 2021, 8:06:11 UTC Validate error 7,347.52 7,347.52 --- Milkyway@home Separation v1.46 windows_x86_64 336975085 188213534 705499 20 Sep 2021, 10:01:26 UTC 21 Sep 2021, 5:51:30 UTC Validate error 7,498.39 7,462.39 --- Milkyway@home Separation v1.46 windows_x86_64 |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
According to the server, these validation errors are accounting for ~5% of all jobs. That's a lot higher than I'd like. Next week I plan on spending some time digging through the WU generator code to try and fix this issue. |
Send message Joined: 16 Mar 10 Posts: 12 Credit: 22,284,745 RAC: 0 |
|
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
I am also getting a lot of invalid jobs, although I have a much lower RAC I seem to be on 10% invalid. The one common thread I notice is they are all single CPU tasks, those that are using multiple CPU’s 4 in my case seem unaffected. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
Looks to me like all WU's processed on my 32bit machine are getting rejected as invalid. Am running 3 as a test. One just got validated so its not that seemingly. One just invalid so a 50% failure rate so far, be interesting to see the last one of the three. |
Send message Joined: 8 Nov 11 Posts: 205 Credit: 2,900,464 RAC: 0 |
Looks to me like all WU's processed on my 32bit machine are getting rejected as invalid. Am running 3 as a test. One just got validated so its not that seemingly. One just invalid so a 50% failure rate so far, be interesting to see the last one of the three. The last one is also invalid which gives a 66% failure rate. Wont be doing any more wasting far too much computer time. |
Send message Joined: 9 Mar 09 Posts: 2 Credit: 10,137,164 RAC: 0 |
Is this the reason I am getting computation errors when running milkyway or do I have other issues. |
Send message Joined: 24 Jan 11 Posts: 712 Credit: 551,946,567 RAC: 45,000 |
I'd say you have other issues. Weird error messages in your failed tasks. As if you were moving tasks around in the BOINC directory while they were being crunched. Not what this thread is about. |
Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0 |
Keith Myers posted (mesage 71189) to this thread and related that the heart of the Milkyway app may be somewhat mushy and could cause both validation and "errors while computing" problems. In response to his posting he received "Not what this thread is about". We need to pay attention to what Keith says. He's only been doing this for 25 years and may be the strongest expert we have. The connection between the validation and the errors and the poor coding will probably become evident when the problems are fixed. I am not real happy about either of the problems I encounter about 200 Invalids every day along with about 30 Computer Errors. If the 7 w/u tasks are messing up validation , how is that se are getting more 7 w/u tasks all the time. Something must be happening between the creation of the task and its starting execution. Could Internet noise be causing the corruption of the task? The Errors while computing find that the executing computer has tried to run an unknown command. It seems that "unknown command" is not due to an app problem, otherwise, all tasks would error after experiencing the first one. Did you know that Internet Noise causes most transfer errors? It drives me nuts on my computers, cell phones and smart TVs (some folks say that is more like a putt than a drive). I really want these problems to be fixed, soon, so lots of luck. |
Send message Joined: 24 Jan 11 Posts: 712 Credit: 551,946,567 RAC: 45,000 |
Almost all the inconclusive validation errors are due to the small differences in computed result on different platforms because of differences in how the cpu or gpu handles the small FFT calculations and rounding. You are most likely to validate against another host using the same platform you compute the task on. AMD against AMD. Nvidia against Nvidia. Android against Android. Intel against Intel etc. etc. The devs could relax the validator mechanism but if they go too far, they let the result get past beyond their validation threshold and the science falls down. They have stated that up to 10% is acceptable and of course would like it to be more on the order of 1%. The parameters that are passed to the application are simple text values and they really should implement the parameter set as binary so they could do a CRC check against the file to guarantee the parameter set isn't being corrupted during internet transmission. Just had to laugh at your Internet Noise complaint as I am currently listening to the blaring digital noise soup on the shortwave bands from all our modern digital devices leaking everything from DC to light out into the electromagnetic spectrum. I can remember listening to the shortwaves back in the 70's and it was quiet as a cave back before all the modern electronics entered our lives. Every power supply was linear and quiet as can be except maybe for some slight 60Hz hum. Now all our power supplies are switched mode and splatter crap everywhere in the bands. It is a wonder that any radio telescope can hear anything over our own din. |
Send message Joined: 9 Jul 17 Posts: 100 Credit: 16,967,906 RAC: 0 |
I can remember listening to the shortwaves back in the 70's and it was quiet as a cave back before all the modern electronics entered our lives. Every power supply was linear and quiet as can be except maybe for some slight 60Hz hum. Now all our power supplies are switched mode and splatter crap everywhere in the bands.But Radio Moscow was in the middle of 40 meters. I wouldn't recognize the place now. |
©2024 Astroinformatics Group