Welcome to MilkyWay@home

Nbody 1.04

Message boards : News : Nbody 1.04
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6

AuthorMessage
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 56933 - Posted: 17 Jan 2013, 20:01:47 UTC - in response to Message 56927.  

We inocculated the rest of the 30677n batch against disk scribbling, but unfortunately they all turned out to be as short as estimated. Got to do some payback to other projects for a while, but we'll fetch some more later today and inocculate them on receipt.

I don't know if this will be any help or not, but I had a CHISQ fault out on MDUE which was what I was expecting since all the wingmen had as well.

The really interesting thing I discovered (but should have realized earlier if I had thought more about it) is I had the slot directory open on the desktop to make it easier to pop in and take a look from time to time, but then wandered off for a bit and didn't close Windows Explorer. When I came back the task had already faulted out and been reported, so I didn't see it happen.

However, since slot folder had been open on the desktop, Windows didn't allow BOINC to delete the files in it when it tried to clean up the mess.

Therefore, I still have a copy of all the files for the task which I would think should be current right up to the point just before it failed.

Thoughts?

Sorry, Lady Watson and I were busy helping Bernd catch a slippery (non-stick) file over at Einstein - missed this one.

As you probably know by now, I don't think this will have worked. XP doesn't automatically refresh the view displayed in an explorer window when the underlying disk status changes: I suspect you will just have had a stale view of something that is no longer there. IIRC, Vista and Win7 are better at auto-refresh - with XP, you have to keep pressing F5 (refresh) to see if the file size changes. OTOH, that gives you the chance to use the 'flicker' test - if something does change, it's more likely the corner of your eye will catch it.
ID: 56933 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 56934 - Posted: 17 Jan 2013, 20:06:11 UTC - in response to Message 56931.  

The one you just posted about (we call it a '196') is a Maximum Disk Usage Exceeded (MDUE) one. Older BOINC CC's may refer to it as a '177'. IOW, a different error code, but basically the same thing.

The older clients actually call it a '-177'. As well as extending the number range, David moved some of the cases from below the axis (negative 'error' - reserved mainly for crashes and program coding errors) to above the axis (positive 'exit status' - more to do with scheduling and operational issues).
ID: 56934 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 56935 - Posted: 17 Jan 2013, 20:40:43 UTC

Meanwhile, here's a genuine new exit status for us: 197 (Maximum elapsed time exceeded) for task 383317771. That would have been an error -177 under an old client too - resource limit exceeded - but the different status number makes it clear that we're talking about a different resource.

Fortunately, the good Doctor managed to capture some clinical samples:

The WU was issued with

<rsc_fpops_est>232567000.000000</rsc_fpops_est>
<rsc_fpops_bound>2325670000000.000000</rsc_fpops_bound>

The final report was

<final_cpu_time>25.911770</final_cpu_time>
<final_elapsed_time>26.156496</final_elapsed_time>
<exit_status>197</exit_status>

And I'm using (in app_info)

<flops>91244303218.731995</flops>

Dividing out the numbers (the units are compatible), that gives an initial runtime estimate of 2.55 milliseconds, and a maximum allowed time of 25.5 seconds. Even with this project's incredibly generous limit of allowing the task to run for 10,000 times longer than its initial estimate (the BOINC default, used on most projects I've checked it on, is 10x), the bound was too low, and BOINC killed the task on schedule.

The flops value I'm using was derived from the APR the server had calculated from the first 14 completed tasks under anonymous platform: that APR has now dropped to 29.123541063248 (* 10^9), so obviously I have some editing to do. If my value had matched the value the server used to calculate <rsc_fpops_bound>, I should have been allowed 79.85 seconds, and I might have got away with it (as my wingmate did).

Unfortunately, the edit will have to wait. I'm now running de_nbody_100K_104_1_1356215205_100871_1 (anyone else seen a 100K?): the remaining time of 65h:30 is clearly an underestimate, with the task barely past 4.2% progress after 5 hours elapsed time. Tuesday morning, you reckon?
ID: 56935 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Alinator

Send message
Joined: 7 Jun 08
Posts: 464
Credit: 56,639,936
RAC: 0
Message 56938 - Posted: 17 Jan 2013, 22:23:31 UTC - in response to Message 56933.  
Last modified: 17 Jan 2013, 22:35:06 UTC

Sorry, Lady Watson and I were busy helping Bernd catch a slippery (non-stick) file over at Einstein - missed this one.

As you probably know by now, I don't think this will have worked. XP doesn't automatically refresh the view displayed in an explorer window when the underlying disk status changes: I suspect you will just have had a stale view of something that is no longer there. IIRC, Vista and Win7 are better at auto-refresh - with XP, you have to keep pressing F5 (refresh) to see if the file size changes. OTOH, that gives you the chance to use the 'flicker' test - if something does change, it's more likely the corner of your eye will catch it.


Yep, I know about the problem with view refreshes on XP-32, but XPP-64 is actually based on early 'Longhorn', and thus is better in that regard.

Ether way, my main point is because Windows wouldn't allow BOINC to clean up the slot folder, I have actual copies of everything that was there just prior to the fault.

I was wondering if you thought they might be useful from a forensics POV. The Project Team might be able to make use of them.

Also, roger that on the new error discovery, and yes the one currently running on the test rig is a 100K (8:55::00 elapsed and running). I noted though that mine isn't 'fresh', it's a resend of a deadline timed out one from originally issued on Jan 5th.
ID: 56938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Swedis

Send message
Joined: 5 Nov 12
Posts: 3
Credit: 6,378,981
RAC: 0
Message 56943 - Posted: 18 Jan 2013, 1:53:08 UTC - in response to Message 56928.  


That hard to say for sure. You have to look at it on a case by case basis.

However, if you have taken the precautions mentioned here, and it does complete, and you are considered 'reliable' by the validator already, or get the right wingmen....

The payoff can be pretty good on the long ones for being a guinea pig. ;-)

Your call. :-D



I will stick with continuing crunching everything that´s on queue :D
Thanks for reply!
ID: 56943 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Treq

Send message
Joined: 10 Oct 12
Posts: 1
Credit: 2,900,244
RAC: 0
Message 56972 - Posted: 21 Jan 2013, 9:00:03 UTC

As a datapoint for you devs:
On my laptop, an Intel i7 with Win7/64bits, the N-Body 1.04 fails always within a second of starting, it seems to me.

http://imgur.com/jPQroFj
ID: 56972 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 57009 - Posted: 24 Jan 2013, 10:44:44 UTC

Here's another datapoint: task de_nbody_100K_104_1_1356215205_100871_1 finished after 532,816 seconds run time, 1,516,005 seconds CPU time. [That's under my 3-thread MT plan class, via anonymous platform]

The task was interrupted by a totally unrelated BSOD on the machine, but recovered successfully from checkpoint and completed without suffering a 374 heap corruption error. That makes me wonder if different task types follow (and hence checkpoint) different processing paths, and the memory reload problem only applies to certain cases? If so, the fact that this one says "Number of particles in bins is very small compared to total. (1 << 100000). Skipping distance calculation" might be a clue.

Too late to do any research on this run, but if 374 re-appears next time, we could try and look for processing information like this in wingmate reports for tasks which crash on exit (presumably before this final line is added to std_err).

Talking of wingmates, I wonder how long it'll be before two more come along with 2½ weeks of fast CPU time to validate this task?
ID: 57009 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Interstel
Avatar

Send message
Joined: 8 Aug 12
Posts: 9
Credit: 156,273
RAC: 0
Message 59131 - Posted: 26 Jun 2013, 23:53:28 UTC - in response to Message 56847.  

Is it ktm32.dll or ktmw32.dll? The error message says the first but your message board link has the second.

James

Joined MilkyWay@Home in 2012
Online since ArpNET days
First activity on Honeywell 1648
Series Mainframe in 1975 at age 12.
ID: 59131 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richard Haselgrove

Send message
Joined: 4 Sep 12
Posts: 219
Credit: 456,474
RAC: 0
Message 59132 - Posted: 27 Jun 2013, 8:32:33 UTC - in response to Message 59131.  

Is it ktm32.dll or ktmw32.dll? The error message says the first but your message board link has the second.

James

ktmw32.dll is a Microsoft Windows support file.

ktm32.dll is a typing error.
ID: 59132 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6

Message boards : News : Nbody 1.04

©2024 Astroinformatics Group