Message boards :
News :
Nbody 1.68 release
Message board moderation
Author | Message |
---|---|
Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0 |
Hi All, A new version, v1.68, of nbody has just been released. I have not yet released the mac multi-threaded version (OpenMP). I will release this at a later date. In this release we have added a new way of constraining the width of the stream. Previously, we were using a measure of the velocity dispersion in each histogram bin. This led us to fit our parameters quite well. Unfortunately, we found that this may not be the best method in the long run. I have added a measure of the beta coordinate dispersion which, from initial findings, will be (hopefully) easier to fit our parameters with. As always, please let me know if there are issues. Thank you all for your continuing support, Sidd |
Send message Joined: 25 Feb 13 Posts: 580 Credit: 94,200,158 RAC: 0 |
Congrats on the new version! |
Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0 |
Sidd, All three of my systems running 1.68 get the following error, some more than others. <core_client_version>7.6.22</core_client_version> <![CDATA[ <message> The data is invalid. (0xd) - exit code 13 (0xd) </message> <stderr_txt> <search_application> milkyway_nbody 1.68 Windows x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 4 max threads on a system with 8 processors Error evaluating NBodyCtx: [string "-- /* Copyright (c) 2016 Siddhartha Shelton..."]:81: bad argument #1 to 'create' (Missing required named argument 'BetaSigma') Failed to read input parameters file 14:52:09 (8888): called boinc_finish(13) My I7 systems get very few but my I5 system gets a ton. Yet are completed ok by other systems on a resend |
Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0 |
Thanks for letting me know!! I'm checking it out right now. |
Send message Joined: 19 May 14 Posts: 73 Credit: 356,131 RAC: 0 |
I believe I found the workunit that was from. Because I added an entirely new calculation, there were some new parameters needed for future flexibility. Therefore, if you were to use the old parameter files on the new binary it would give that error. It seems for some reason, that workunit did exactly that, the binary being used is the nbody v168 but the workunit is from the v166 runs, using the v166 parameter files. Before releasing, I took down the older runs, and so I was not expecting the work units to do this, and for that I apologize. Fortunately, this error would occur right at the beginning, before anything began to run so it will not cause any wasted computational time. If you have any v166 runs in your queue, you can go ahead and cancel them so they do not give this error. |
Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0 |
Thanks Sidd |
Send message Joined: 27 Sep 17 Posts: 6 Credit: 11,189,753 RAC: 0 |
Sidd, Most of my v168 runs are blowing up. Do I abort the v168 runs in queue? Thanks. |
Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0 |
Mossy, The problem is when the V168 application tries to process V166 data You have same issue as I if you look inside the stderr de_nbody_1_13_2018_v166_20k__optimizerparameters_diff_seedruns_3_1516211024_96183_4 this is the data version. As Sidd says it only takes 2 or 3 seconds to fail, so if there are no other ramifications (like not getting new tasks:-)) just let them run. otherwise you have to highlite the task in the task list in BOINC then choose properties to see the version v166 or v168 |
Send message Joined: 27 Sep 17 Posts: 6 Credit: 11,189,753 RAC: 0 |
Tom, Gotcha. Thanks for letting me know how to find the mis-matches in the "ready to start" state. I just aborted a few. |
Send message Joined: 9 Feb 17 Posts: 1 Credit: 71,380 RAC: 0 |
Hey, I am new here. I am getting errors on nbody calculating the optimizerparameter with a Ryzen 1700. The Cpu is prime stable tho I get the erros only on nbody-optimizertasks. Everything else works fine. |
Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,007,009,348 RAC: 5,238 |
These are still being sent out. :( https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1566643057 |
Send message Joined: 2 Oct 16 Posts: 167 Credit: 1,007,009,348 RAC: 5,238 |
And another https://milkyway.cs.rpi.edu/milkyway/result.php?resultid=2255257329 |
Send message Joined: 27 Jan 15 Posts: 10 Credit: 1,512,630 RAC: 1 |
I keep the intermittent N-body that runs and runs... the last one I aborted at 15 hours... https://milkyway.cs.rpi.edu/milkyway/workunit.php?wuid=1573440465 Sometimes if I restart BOINC, they'll run properly, but I've seen them bite the dust too shortly after too... |
Send message Joined: 27 Jan 15 Posts: 10 Credit: 1,512,630 RAC: 1 |
Went searching for an answer to this, but couldn't find an answer: Why is the N-Body credit such a pittance with double digit credit even when run time is the same as the regular WU? |
Send message Joined: 13 Nov 17 Posts: 4 Credit: 3,239,591 RAC: 0 |
had great hopes the new model Nbody would resolve faults: still getting Nbody trapped, sometimes suspend helps (so far i am around 5 successes to 20 failures), restart has so far not. Second thought: with our Nbody fails being a pain in the proverbial, are they responsible for some of the failed reporting stuff? In particular their runtime can exceed their report times, which could also cause failures on single processor tasks paused to complete a multicore Nbody that never sees completion? |
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
How does this happen? Stderr output <core_client_version>7.8.6</core_client_version> <![CDATA[ <message> process exited with code 13 (0xd, -243)</message> <stderr_txt> <search_application> milkyway_nbody 1.66 Darwin x86_64 double OpenMP, Crlibm </search_application> Using OpenMP 8 max threads on a system with 8 processors Application version too old. Workunit requires version 1.68, but this is 1.66 Failed to read input parameters file 04:21:59 (82996): called boinc_finish(13) </stderr_txt> ]]> |
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
Sidd wrote: It seems for some reason, that workunit did exactly that, the binary being used is the nbody v168 but the workunit is from the v166 runs, using the v166 parameter files. Before releasing, I took down the older runs... Maybe I misunderstand, but v166 tasks are still being sent out. |
Send message Joined: 4 Oct 11 Posts: 38 Credit: 309,729,457 RAC: 0 |
Think we need a new version of the application that can process both v166 and v168 data file formats. PLEASE I have only been getting v166 lately is there a pointer the the v166 application? |
Send message Joined: 16 Jun 08 Posts: 93 Credit: 366,882,323 RAC: 0 |
It looked good for a few days, but I've picked up some v166 tasks recently (see examples). |
Send message Joined: 13 Nov 17 Posts: 4 Credit: 3,239,591 RAC: 0 |
Sorry for the Necro: but a thought came to me, back when we started the 'new' N-Body version faults in N-Body processes seemed reduced! It looked like the problem of them getting stuck with no further gain in progress completion was improved, while now i have seen only 3 reach completion in as many weeks with all others having say 3 or 4 hours of processing time with variable number of hours till completion is reached (from 10 hours to 3 weeks) depending where the process has gotten stuck. A worst case scenario was quoting something like 157 days when it got stuck at ~0.05% for a few hours (just a little beyond the deadline *cough*) A thought came to me this morning as i clear an overnight frozen Nbody: If everyone hitting a faulty work unit passes it on: will it be passed back into the pool without examination for other machines to attempt? if a packet of data strikes suck a bug will it move up the priority chain to be solved more quickly increasing the 'density' of faulty work units to be computed? Because it just seems odd that so many of these units in particular are faulting: when none of the other variants strike errors. (which i would hope would suggest there is nothing wrong with the logical processors of this device, otherwise i'll need to look to fixing it) TLDR: -Number of Nbody work units locking up seems to be getting worse: are they being prioritized by your distribution system? -Do your systems have a way to measure or monitor how many times such work units are being passed back and forth without reaching completion? Side query A: -At one point your server was no longer sending Nbodies to my machine (solves the problem well enough) presumably due to the 'high error' solution you rolled out, but now they are back and as glitch y as ever: how many processes do i need to abort to be thrown back onto that list (if this is at all how this happened) Side gripe B As a personal complaint: I do wish Bionic could reduce the total cpu% being utilized, this insistence on using 100% processor X% of the time causes thermal spikes that can lead to thermal throttling. Poor laptop. This mornings Nbody lockup happened at 1hr 24 minutes at 15ish% in de_nbody_4_19_2018_v168_20k__Data_1_1523906284_51420 |
©2024 Astroinformatics Group