Message boards :
News :
Validator Outage
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 1 |
I will take a day to look at the source code again when I am recovered from my surgery and we are finished working on our NSF proposal. I am hoping that it is something that can be fixed on the server side of things, and the Separation client code doesn't need to be rebuilt. I too hope you can recover quickly from your surgery!! Maybe during that time you can reach out to the other Boinc Admins at the other Projects for some quick advice on where to look ie the Seperation side or the Boinc side of things. Maybe they can help with the setting requiring a 10 minute back-off when getting new gpu tasks as well. I'm hoping you can work this is inbetween all the 'other' work stuff you will be doing as well. Good luck on your surgery!!! |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Thank you everyone! I just got my wisdom teeth removed this morning so I'm not sure how much I will be working for a little while. Nothing serious! Unfortunately I have so many projects that this one often gets pushed to low priority, which I definitely understand is frustrating for you. I'd like to be able to do everything quickly, but it's not usually possible. I'll do my best to squeeze this in soon, though. Thanks for being understanding. |
Send message Joined: 26 Apr 08 Posts: 87 Credit: 64,801,496 RAC: 0 |
Thank you, Tom Donlon, for all that you do. It is appreciated. Plus SETI Classic = 21,082 WUs |
Send message Joined: 1 Dec 10 Posts: 82 Credit: 15,452,009,012 RAC: 0 |
Invalid and errored tasks on the increase again 560 invalid in 24 hrs 480 error. This really needs to be sorted Tom. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Cleared out the bad WU. Apologies. |
Send message Joined: 1 Dec 10 Posts: 82 Credit: 15,452,009,012 RAC: 0 |
Thank you Tom. I hope you are feeling much better. |
Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0 |
Tom, We need you to unstuck the software again. Today I encountered about 60 7 WU tasks. They were all sent today and run today. On average they spent 6 hours in my computers. So they are fresh; indicating to me that you have a stuck one. And no, I haven't forgotten my war against Errors While Computing. Did you know that Rosetta at Home is experiencing a bunch of flawed tasks, even as we speak. Milkyway and Rosetta are both BOINC users. Makes one wonder whether the malfeasant software might live in BOINC servers. I would like to see the end of Errors While Computing; they imply my computers erred. It ain't true. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I took a look and there aren't any stuck WUs. They might just be leftover from the last stuck one. It could totally be that there's a bug somewhere in the BOINC code, but it's also just as reasonable that the issue is with the custom Milkyway code. Good to know that we're all struggling together with these things. |
Send message Joined: 1 Dec 10 Posts: 82 Credit: 15,452,009,012 RAC: 0 |
Hello Tom, 1367 Errors |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 108,179,677 RAC: 5,067 |
Hello Tom, As a user who'd like to help if he can, may I ask whether those are marked as Error or Invalid (or a mixture)? And are they Separation or NBody?? I can't tell because your computers are currently hidden... The issue Tom is trying to deal with in this thread is only Separation tasks being marked Invalid because of malformed tasks being sent out. If you're suffering errors rather than being hit by huge numbers of those tasks that will go Invalid, the community might be able to help if they can see what sort of errors you are getting; and given that there aren't lots of other people sailing in reporting bad tasks going Invalid at the moment, there's cause to wonder... Cheers - Al. P.S. I have monitoring set up on my systems to spot the "7 work unit" tasks as they arrive (so I can choose to abort them before they start) - I haven't seen one in several days (and would be quick to report any I saw that weren't late retries!...) |
Send message Joined: 1 Dec 10 Posts: 82 Credit: 15,452,009,012 RAC: 0 |
Hi Al, All of my hosts crunch Separation Units. My current level of wasted units Invalid 209 Error 1355 I have experienced over 7500 Invalid recently but this is quite high for Error's Cheers |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 108,179,677 RAC: 5,067 |
Hi Al, O.K. -- thanks for clarifying... Without knowing how many systems are involved, and how many of those Errors are tasks not returned in time (I think MW logs both User Aborted and Server Aborted as Error...), it's difficult to comment further on Errors. If some of them are genuine errors (e.g. GPU code compilation errors, device errors, invalid memory accesses...) whatever is causing those might also be contributing to the Invalid count by returning results that actually are Invalid! At risk of telling you something you already know (in which case my apologies!), there are currently two reasons a task can be flagged Invalid. The case that can (and presumably, eventually, will) be solved is the tasks that seem to contain 7 sub-tasks instead of 4 - Tom has arranged that those are automatically flagged Invalid -- there's potential bad science in there! The other case is because as the results converge(?) on a final solution some of the calculations may be dealing with very small values and different bits of hardware will produce subtly different results due to rounding errors, truncations to zero, and so forth. If such issues happen early enough in a particular sub-task the results from different hardware (or different compilers) can diverge by enough to introduce uncertainty as to which results to select as "canonical" -- in such cases, someone is going to end up with Invalid results... If you've got a consistent source of Error flags that isn't due to time-outs, you could always start a thread about it in Number Crunching, and someone will certainly pitch in to help... Happy crunching - Al. |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 108,179,677 RAC: 5,067 |
Tom, There seem to be some more 7-WU tasks out there; I've just aborted tasks from the workunits listed below; all of them were initially created within the last 24 hours... Workunit 234295593 -- de_modfit_84_bundle4_4s_south4s_gapfix_1636154231_1109369 Workunit 234328098 -- de_modfit_85_bundle4_4s_south4s_gapfix_bgset2_1636154231_1140632 Workunit 234395293 -- de_modfit_85_bundle4_4s_south4s_gapfix_bgset2_1636154231_1204374 Workunit 235146445 -- de_modfit_84_bundle4_4s_south4s_gapfix_1636154231_1915344 Fortunately, these constitute less than 3% of the total tasks I've received so far today! [Edit - two more!...] Workunit 235070057 -- de_modfit_85_bundle4_4s_south4s_gapfix_bgset2_1636154231_1842523 Workunit 235140512 -- de_modfit_84_bundle4_4s_south4s_gapfix_1636154231_1909654 Cheers - Al. |
Send message Joined: 1 Dec 10 Posts: 82 Credit: 15,452,009,012 RAC: 0 |
Invalid 760 Error's 1640 |
Send message Joined: 1 Dec 10 Posts: 82 Credit: 15,452,009,012 RAC: 0 |
Hello Tom, Things are just getting worse. Invalid 1122 Error's 1717 |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Hi All, I was away from my computers yesterday and wasn't able to fix things until now. Cleared out two WUs. Also, I've noticed some users getting hostile about this issue in the message boards and my DMs. I would like to remind you that Separation is a small part of my job, and it is not the focus of my PhD thesis. Unfortunately I don't always have the time to drop everything and dig through Separation code. Right now, all of my time is going into proposals to keep myself and MilkyWay@home funded. When things calm down, I will take the time and find a solution. I understand that many of you have a financial interest in crunching WUs, and that the longer that I take to fix the problem the less money that you make. Apologies for the delays. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I went through and purged all unsent jobs from the DB that had 7 WUs. Hopefully this decreases your validate error count rapidly instead of just waiting for things to work themselves out. |
Send message Joined: 16 Mar 10 Posts: 211 Credit: 108,179,677 RAC: 5,067 |
Tom, Sorry to be the bearer of bad news yet again... I've just cleared out the three tasks below, all for new work units Workunit 239227782 name de_modfit_84_bundle4_4s_south4s_gapfix_1636388182_2840024 created 11 Nov 2021, 8:25:19 UTC Task 409067936 Created 11 Nov 2021, 17:39:20 UTC, Sent 11 Nov 2021, 17:49:36 UTC Workunit 239471045 name de_modfit_84_bundle4_4s_south4s_gapfix_1636388182_3069694 created 11 Nov 2021, 13:29:40 UTC Task 409067359 Created 11 Nov 2021, 17:38:27 UTC, Sent 11 Nov 2021, 17:49:36 UTC Workunit 239471509 name de_modfit_84_bundle4_4s_south4s_gapfix_1636388182_3070149 created 11 Nov 2021, 13:30:03 UTC Task 409067360 Created 11 Nov 2021, 17:38:27 UTC, Sent 11 Nov 2021, 17:49:36 UTC Is there anything any of us can do to help you resolve this (other than just drawing your attention to them - hopefully early enough to let you stop them quickly!); I (for one) would be willing to look at code, but I'm on the wrong side of the Atlantic to just drop in :-) Cheers - Al. |
Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0 |
`Tom, The 7 WUs are back. Back consuming 1.62 times the energy required to run the normal 4 WU tasks and wasting 1.62 times the computing time. It is a serious problem. I know you will "unstick" the hung 7 WU task and all will be well for a couple of days. That isn't a fix it is a work around. And, it is not fair to you; you have to spend a bunch of time mucking around in the software keep the system running; again and again. What must we do to get a solution to this problem? It has to be fixed. We can't trivialize this problem as an inconvenience we can tolerate. Some users might tolerate it but many will not (including me). Save the Wilkyway!! Is that cosmic or what? |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
Cleared out 2 bad WUs. I may have an idea for a patchy fix that could be implemented quickly. Might take a look today and try it if I get around to it. If I do, the server will go up and down a few times later. |
©2024 Astroinformatics Group