Welcome to MilkyWay@home

Nbody 1.68 release

Message boards : News : Nbody 1.68 release
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Sidd
Project developer
Project tester
Project scientist

Send message
Joined: 19 May 14
Posts: 73
Credit: 356,131
RAC: 0
Message 67017 - Posted: 31 Jan 2018, 15:51:18 UTC

Hi All,

A new version, v1.68, of nbody has just been released. I have not yet released the mac multi-threaded version (OpenMP). I will release this at a later date.

In this release we have added a new way of constraining the width of the stream. Previously, we were using a measure of the velocity dispersion in each histogram bin. This led us to fit our parameters quite well. Unfortunately, we found that this may not be the best method in the long run. I have added a measure of the beta coordinate dispersion which, from initial findings, will be (hopefully) easier to fit our parameters with.

As always, please let me know if there are issues.

Thank you all for your continuing support,
Sidd
ID: 67017 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jake Weiss
Volunteer moderator
Project developer
Project tester
Project scientist

Send message
Joined: 25 Feb 13
Posts: 580
Credit: 94,200,158
RAC: 0
Message 67018 - Posted: 31 Jan 2018, 19:05:19 UTC

Congrats on the new version!
ID: 67018 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tom*

Send message
Joined: 4 Oct 11
Posts: 38
Credit: 309,729,457
RAC: 0
Message 67019 - Posted: 31 Jan 2018, 21:38:17 UTC
Last modified: 31 Jan 2018, 21:38:55 UTC

Sidd,

All three of my systems running 1.68 get the following error, some more than others.

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
The data is invalid.
(0xd) - exit code 13 (0xd)
</message>
<stderr_txt>
<search_application> milkyway_nbody 1.68 Windows x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 4 max threads on a system with 8 processors
Error evaluating NBodyCtx: [string "-- /* Copyright (c) 2016 Siddhartha Shelton..."]:81: bad argument #1 to 'create' (Missing required named argument 'BetaSigma')
Failed to read input parameters file
14:52:09 (8888): called boinc_finish(13)

My I7 systems get very few but my I5 system gets a ton.

Yet are completed ok by other systems on a resend
ID: 67019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sidd
Project developer
Project tester
Project scientist

Send message
Joined: 19 May 14
Posts: 73
Credit: 356,131
RAC: 0
Message 67020 - Posted: 1 Feb 2018, 0:44:44 UTC - in response to Message 67019.  

Thanks for letting me know!! I'm checking it out right now.
ID: 67020 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Sidd
Project developer
Project tester
Project scientist

Send message
Joined: 19 May 14
Posts: 73
Credit: 356,131
RAC: 0
Message 67022 - Posted: 1 Feb 2018, 1:10:13 UTC - in response to Message 67019.  

I believe I found the workunit that was from.

Because I added an entirely new calculation, there were some new parameters needed for future flexibility. Therefore, if you were to use the old parameter files on the new binary it would give that error. It seems for some reason, that workunit did exactly that, the binary being used is the nbody v168 but the workunit is from the v166 runs, using the v166 parameter files. Before releasing, I took down the older runs, and so I was not expecting the work units to do this, and for that I apologize.

Fortunately, this error would occur right at the beginning, before anything began to run so it will not cause any wasted computational time. If you have any v166 runs in your queue, you can go ahead and cancel them so they do not give this error.
ID: 67022 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tom*

Send message
Joined: 4 Oct 11
Posts: 38
Credit: 309,729,457
RAC: 0
Message 67023 - Posted: 1 Feb 2018, 4:01:47 UTC

Thanks Sidd
ID: 67023 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MossyRock

Send message
Joined: 27 Sep 17
Posts: 6
Credit: 11,189,753
RAC: 0
Message 67025 - Posted: 1 Feb 2018, 22:55:56 UTC

Sidd,

Most of my v168 runs are blowing up. Do I abort the v168 runs in queue?

Thanks.
ID: 67025 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tom*

Send message
Joined: 4 Oct 11
Posts: 38
Credit: 309,729,457
RAC: 0
Message 67026 - Posted: 2 Feb 2018, 1:48:00 UTC
Last modified: 2 Feb 2018, 1:53:07 UTC

Mossy,

The problem is when the V168 application tries to process V166 data

You have same issue as I if you look inside the stderr

de_nbody_1_13_2018_v166_20k__optimizerparameters_diff_seedruns_3_1516211024_96183_4

this is the data version.

As Sidd says it only takes 2 or 3 seconds to fail, so if there are no
other ramifications (like not getting new tasks:-)) just let them run.

otherwise you have to highlite the task in the task list in BOINC
then choose properties to see the version v166 or v168
ID: 67026 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
MossyRock

Send message
Joined: 27 Sep 17
Posts: 6
Credit: 11,189,753
RAC: 0
Message 67027 - Posted: 2 Feb 2018, 4:21:24 UTC - in response to Message 67026.  

Tom,

Gotcha.

Thanks for letting me know how to find the mis-matches in the "ready to start" state. I just aborted a few.
ID: 67027 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Schwerrechner

Send message
Joined: 9 Feb 17
Posts: 1
Credit: 71,380
RAC: 0
Message 67028 - Posted: 2 Feb 2018, 12:23:56 UTC

Hey, I am new here. I am getting errors on nbody calculating the optimizerparameter with a Ryzen 1700. The Cpu is prime stable tho I get the erros only on nbody-optimizertasks.
Everything else works fine.
ID: 67028 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Oct 16
Posts: 167
Credit: 1,007,009,348
RAC: 5,242
Message 67042 - Posted: 9 Feb 2018, 13:54:41 UTC

These are still being sent out. :(
https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1566643057
ID: 67042 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mmonnin

Send message
Joined: 2 Oct 16
Posts: 167
Credit: 1,007,009,348
RAC: 5,242
Message 67047 - Posted: 9 Feb 2018, 22:44:59 UTC

And another
https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=2255257329
ID: 67047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yavanius
Avatar

Send message
Joined: 27 Jan 15
Posts: 10
Credit: 1,512,630
RAC: 1
Message 67050 - Posted: 10 Feb 2018, 5:26:00 UTC - in response to Message 67017.  

I keep the intermittent N-body that runs and runs... the last one I aborted at 15 hours...

https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1573440465

Sometimes if I restart BOINC, they'll run properly, but I've seen them bite the dust too shortly after too...
ID: 67050 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yavanius
Avatar

Send message
Joined: 27 Jan 15
Posts: 10
Credit: 1,512,630
RAC: 1
Message 67056 - Posted: 10 Feb 2018, 18:05:49 UTC - in response to Message 67017.  

Went searching for an answer to this, but couldn't find an answer:

Why is the N-Body credit such a pittance with double digit credit even when run time is the same as the regular WU?
ID: 67056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr McGill

Send message
Joined: 13 Nov 17
Posts: 4
Credit: 3,239,591
RAC: 0
Message 67058 - Posted: 10 Feb 2018, 22:26:24 UTC

had great hopes the new model Nbody would resolve faults: still getting Nbody trapped, sometimes suspend helps (so far i am around 5 successes to 20 failures), restart has so far not.

Second thought: with our Nbody fails being a pain in the proverbial, are they responsible for some of the failed reporting stuff? In particular their runtime can exceed their report times, which could also cause failures on single processor tasks paused to complete a multicore Nbody that never sees completion?
ID: 67058 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 67060 - Posted: 11 Feb 2018, 2:36:29 UTC

How does this happen?

Stderr output
<core_client_version>7.8.6</core_client_version>
<![CDATA[
<message>
process exited with code 13 (0xd, -243)</message>
<stderr_txt>
<search_application> milkyway_nbody 1.66 Darwin x86_64 double OpenMP, Crlibm </search_application>
Using OpenMP 8 max threads on a system with 8 processors
Application version too old. Workunit requires version 1.68, but this is 1.66
Failed to read input parameters file
04:21:59 (82996): called boinc_finish(13)

</stderr_txt>
]]>
ID: 67060 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 67062 - Posted: 11 Feb 2018, 12:28:20 UTC - in response to Message 67022.  

Sidd wrote:
It seems for some reason, that workunit did exactly that, the binary being used is the nbody v168 but the workunit is from the v166 runs, using the v166 parameter files. Before releasing, I took down the older runs...

Maybe I misunderstand, but v166 tasks are still being sent out.
ID: 67062 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Tom*

Send message
Joined: 4 Oct 11
Posts: 38
Credit: 309,729,457
RAC: 0
Message 67068 - Posted: 11 Feb 2018, 18:05:24 UTC
Last modified: 11 Feb 2018, 18:16:12 UTC

Think we need a new version of the application that can process both
v166 and v168 data file formats.

PLEASE

I have only been getting v166 lately is there a pointer the the v166 application?
ID: 67068 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ritterm
Avatar

Send message
Joined: 16 Jun 08
Posts: 93
Credit: 366,882,323
RAC: 0
Message 67093 - Posted: 16 Feb 2018, 19:41:05 UTC

It looked good for a few days, but I've picked up some v166 tasks recently (see examples).
ID: 67093 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Mr McGill

Send message
Joined: 13 Nov 17
Posts: 4
Credit: 3,239,591
RAC: 0
Message 67421 - Posted: 2 May 2018, 21:22:37 UTC

Sorry for the Necro: but a thought came to me, back when we started the 'new' N-Body version faults in N-Body processes seemed reduced! It looked like the problem of them getting stuck with no further gain in progress completion was improved, while now i have seen only 3 reach completion in as many weeks with all others having say 3 or 4 hours of processing time with variable number of hours till completion is reached (from 10 hours to 3 weeks) depending where the process has gotten stuck. A worst case scenario was quoting something like 157 days when it got stuck at ~0.05% for a few hours (just a little beyond the deadline *cough*)

A thought came to me this morning as i clear an overnight frozen Nbody: If everyone hitting a faulty work unit passes it on: will it be passed back into the pool without examination for other machines to attempt? if a packet of data strikes suck a bug will it move up the priority chain to be solved more quickly increasing the 'density' of faulty work units to be computed? Because it just seems odd that so many of these units in particular are faulting: when none of the other variants strike errors. (which i would hope would suggest there is nothing wrong with the logical processors of this device, otherwise i'll need to look to fixing it)


TLDR:
-Number of Nbody work units locking up seems to be getting worse: are they being prioritized by your distribution system?
-Do your systems have a way to measure or monitor how many times such work units are being passed back and forth without reaching completion?

Side query A:
-At one point your server was no longer sending Nbodies to my machine (solves the problem well enough) presumably due to the 'high error' solution you rolled out, but now they are back and as glitch y as ever: how many processes do i need to abort to be thrown back onto that list (if this is at all how this happened)


Side gripe B
As a personal complaint: I do wish Bionic could reduce the total cpu% being utilized, this insistence on using 100% processor X% of the time causes thermal spikes that can lead to thermal throttling. Poor laptop.

This mornings Nbody lockup happened at 1hr 24 minutes at 15ish% in de_nbody_4_19_2018_v168_20k__Data_1_1523906284_51420
ID: 67421 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : Nbody 1.68 release

©2024 Astroinformatics Group