New Separation Runs 6/9/2021

Author	Message
paris Send message Joined: 26 Apr 08 Posts: 87 Credit: 64,801,496 RAC: 0	Message 70921 - Posted: 26 Jun 2021, 14:46:03 UTC Same problem here. 50 invalids over the past three or four days. Plus SETI Classic = 21,082 WUs ID: 70921 · Rating: 0 · rate: / Reply Quote

Frank Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0	Message 70928 - Posted: 27 Jun 2021, 13:34:20 UTC - in response to Message 70858. Well, if you wanted Invalids on the gapfix_bgset3 you got them by their hundreds. Personally, I don't like them. They are mainly occurring on CPU tasks but some show up on GPU tasks. ID: 70928 · Rating: 0 · rate: / Reply Quote

Vester Send message Joined: 30 Dec 14 Posts: 34 Credit: 910,198,756 RAC: 1,091	Message 70932 - Posted: 27 Jun 2021, 19:01:45 UTC I currently have 452 invalid tasks on three different computers. Invalid tasks are on AMD GPUs and CPU tasks on a Dell T-420 server with two Intel Xeon processors. I've given up trying to figure if I possibly have hardware problems. ID: 70932 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 70933 - Posted: 27 Jun 2021, 19:11:57 UTC Ugh, it looks like the invalid problems may not be fixed after all. I'll have to look into things in the next couple days. ID: 70933 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 70936 - Posted: 28 Jun 2021, 21:17:43 UTC Last modified: 28 Jun 2021, 21:18:13 UTC These validation problems are happening because different types of machines & setups are returning (slightly) different values for the likelihood of each workunit. The likelihoods are only off by a few parts in a million, but luckily we don't actually need the results to be that precise. I am looking into changing the validator for the project instead taking down the runs so that these stripes can get crunched. That may take a couple days though, so thanks for your patience in the meantime. ID: 70936 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 716 Credit: 558,805,369 RAC: 31,577	Message 70937 - Posted: 29 Jun 2021, 0:17:31 UTC - in response to Message 70936. Will relaxing the validator limits hurt the science results for future task sets? ID: 70937 · Rating: 0 · rate: / Reply Quote

mikey Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0	Message 70938 - Posted: 29 Jun 2021, 1:52:19 UTC - in response to Message 70937. Will relaxing the validator limits hurt the science results for future task sets? If so maybe they can match up the pc's doing the crunching so ie Linux pc's only validate against each other and the same with Windows pc's. With Windows being MUCH more represented in the stats page it really shouldn't matter very much. ID: 70938 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 716 Credit: 558,805,369 RAC: 31,577	Message 70939 - Posted: 29 Jun 2021, 3:28:53 UTC - in response to Message 70938. Not sure where in the BOINC server code you can configure the scheduler for that change. Don't think it possible. Hope a dev will chime in and tell me I'm wrong. ID: 70939 · Rating: 0 · rate: / Reply Quote

alanb1951 Send message Joined: 16 Mar 10 Posts: 213 Credit: 109,633,250 RAC: 1,174	Message 70940 - Posted: 29 Jun 2021, 3:32:07 UTC - in response to Message 70938. Will relaxing the validator limits hurt the science results for future task sets? If so maybe they can match up the pc's doing the crunching so ie Linux pc's only validate against each other and the same with Windows pc's. With Windows being MUCH more represented in the stats page it really shouldn't matter very much. That might help quite a lot with with CPU-to-CPU comparisons as it would [hopefully] eliminate rounding inconsistencies due to different compilers doing different optimizations and ordering of operati0ns within complicated expressions. However, it might not help as much when a CPU job is matched against a GPU job, especially if the OpenCL is using fused multiply-add... Different GPUs have different attitudes to rounding and how close one can get to zero before a value is treated as zero - and that can be a "feature" of hardware versions from the same supplier! And I've seen tasks where a Windows CPU result wouldn't validate with a Windows GPU job, so splitting out by Operating System alone may not be enough... I suspect that if Tom relaxes the validation constraints slightly we might still see some issues, but probably nowhere near as many. Without knowing the current required precision one can only speculate - so good luck to Tom! Cheers - Al. ID: 70940 · Rating: 0 · rate: / Reply Quote

Keith Myers Send message Joined: 24 Jan 11 Posts: 716 Credit: 558,805,369 RAC: 31,577	Message 70941 - Posted: 29 Jun 2021, 5:13:11 UTC Well for an example of what is possible. I have not a single invalid or inconclusive task at TN-Grid where there are 3 separate cpu applications in play on the same datasets. SSE2, AVX and FMA. I have validated against every other type of application by my wingmen. So it is possible to get the same results or close enough to be considered valid if you setup the validator with relaxed enough parameters. ID: 70941 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 70942 - Posted: 30 Jun 2021, 2:52:11 UTC Last modified: 30 Jun 2021, 2:54:46 UTC There is a way to set up BOINC so that machines will only validate against other machines of the same type, but it seems like a pain and I believe it would force us to reconfigure a large portion of the project. Definitely not something I'm trying to do. When I was digging through workunit validations, it looked like the failed validations were all working but were just off by something like +/-0.00001, probably due to inconsistencies in CPU vs GPU or CPU compiler options, etc. I am planning on turning down (up?) the tolerance so that those slightly different results are accepted. I was looking through the server code today to try to find where to do that, but didn't see it. I will keep working on this so that hopefully things will be fixed soon. I'm not sure if a lowered validation threshold could potentially cause problems with the project science in the future... I'm not immediately sure why they would, as long as the threshold isn't set to be too lenient. It could just as easily be that we are already using too stringent of a validation threshold right now! Anyways, if it causes problems with the science, hopefully it would just be a matter of changing the tolerance back to the value it was before. ID: 70942 · Rating: 0 · rate: / Reply Quote

Nigel Conway Send message Joined: 11 Feb 14 Posts: 4 Credit: 321,527 RAC: 0	Message 70943 - Posted: 30 Jun 2021, 16:38:39 UTC - in response to Message 70858. Hi! FYI, I have had validate errors for tasks 231095611 and 231095541! Nigel. ID: 70943 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 70944 - Posted: 1 Jul 2021, 18:03:35 UTC The good news is that I've found the tolerance setting. The bad news is that in order to change it, I have to rebuild a couple binaries. I'm going to be working on this for a little while, but I think that it may require a server restart (or at least a temporary outage of a few services). I'll keep everyone up to date with progress as I figure more out. The validation errors seem to be holding steady at <8% for the time being, which isn't ideal, but it's not a lot and it doesn't seem to be increasing very quickly. ID: 70944 · Rating: 0 · rate: / Reply Quote

Frank Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0	Message 70948 - Posted: 10 Jul 2021, 0:58:50 UTC - in response to Message 70944. I am seeing 11.44% Invalids and fit is increasing daily. I have seen reports where four computers (2 Windows 10 and 2 Linux) tried to evaluate 4 completed tasks. 2 Tasks were judged to be Invalid, 1 Windows 10 and 1 Linux. Makes me think that problem doesn't reside in an operating system. Keep smilin'. ID: 70948 · Rating: 0 · rate: / Reply Quote

AquaBoy Send message Joined: 30 Apr 18 Posts: 4 Credit: 26,044,476 RAC: 0	Message 70951 - Posted: 10 Jul 2021, 9:02:23 UTC Maybe, better if you eventually assign credits for workunits having validation errors? The point is that our PCs spend their time to calculate results so they deserve credits anyway. As far as I know, some projects support this feature. ID: 70951 · Rating: 0 · rate: / Reply Quote

paris Send message Joined: 26 Apr 08 Posts: 87 Credit: 64,801,496 RAC: 0	Message 70952 - Posted: 10 Jul 2021, 13:16:04 UTC My RAC has decreased from around 33,500 to under 25,000. I'm running about 20% invalid. I hope something can be done and that this is not going to be the "cost of doing business". Nevertheless, I enjoy running the project but I would like to see all of my work count. Plus SETI Classic = 21,082 WUs ID: 70952 · Rating: 0 · rate: / Reply Quote

Tom Donlon Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0	Message 70953 - Posted: 10 Jul 2021, 17:14:53 UTC I have made changes to the validator and it is currently being run on the test server. I apologize for the high rate of invalids in the meantime, but a fix is coming. I would rather make sure that this new validator works than put a broken validator up on the server. ID: 70953 · Rating: 0 · rate: / Reply Quote

Jimbocous Send message Joined: 7 Mar 20 Posts: 22 Credit: 107,093,252 RAC: 9,877	Message 70957 - Posted: 11 Jul 2021, 8:21:07 UTC Last modified: 11 Jul 2021, 8:23:33 UTC Seeing about a 20% invalid rate on my running hosts as well (1 linux, 1 win10, both running CPU tasks only). Good to see in the forum here the issue's being worked. I'll take another look when I hear the change hits production. Thanks! ID: 70957 · Rating: 0 · rate: / Reply Quote

Cavalary Send message Joined: 23 Aug 11 Posts: 35 Credit: 13,094,693 RAC: 20,710	Message 70959 - Posted: 13 Jul 2021, 2:16:30 UTC Sure got worried when I saw my RAC drop by some 40% and a whole bunch of validate errors. But maybe it's just this issue and not my computer... ID: 70959 · Rating: 0 · rate: / Reply Quote

Siran d'Vel'nahr Send message Joined: 1 Jul 08 Posts: 88 Credit: 25,079,058 RAC: 0	Message 70960 - Posted: 13 Jul 2021, 9:15:27 UTC - in response to Message 70951. Maybe, better if you eventually assign credits for workunits having validation errors? The point is that our PCs spend their time to calculate results so they deserve credits anyway. As far as I know, some projects support this feature. Hi AB, I'm beginning to agree with you in that our PCs do the work without errors just to have the validation system err for whatever reason and we should be compensated for the work done. I don't remember the project I was on some time ago, but they had the system set to credit us for the work done on tasks that had validation errors. My RAC here was finally back up to 121K and now it has dropped to just over 102K, soon to drop under that. :-( Have a great day! :) Siran CAPT Siran d'Vel'nahr XO - L L & P _\\// USS Vre'kasht NCC-33187 Winders 10 OS? "What a piece of junk!" - L. Skywalker "Logic is the cement of our civilization with which we ascend from chaos using reason as our guide." - T'Plana-hath ID: 70960 · Rating: 0 · rate: / Reply Quote