Message boards :
News :
Work should be flowing
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
As noted elsewhere, the workaround is to set up a automatic process which stop/starts the various processes or does a full server down/restart to clear out the problems (temporarily) -- but this is only a workaround, as it is fairly clear that there is an underlying root cause problem which needs attention. I'm pretty sure it's because it's the summer and RPI is doing rewiring and moving machines around so labstaff keep shutting the milkyway server down. |
Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,262 RAC: 263 |
Travis, I have to disagree -- this problem has been going on for months now, every to days to a week, the validator and work generator stop validating and generating. Someone then goes in and either stop/starts processes or does a full server restart and the problem 'goes away' -- until it resurfaces. This not a power outage issue (though those have happened), it is an underlying process problem (memory leak, whatever) that is going on and only gets treated symptomatically by a restart. The situation -- this instant is a demonstration of this -- the server has reported 'all green', yet no work is available and the awaiting validation population is growing rapidly as work is reporting in with the validator not working. This has been the case since early this morning and this situation has been repeating itself exactly this way every three days to a week for at least a couple of months -- perhaps longer. As noted elsewhere, the workaround is to set up a automatic process which stop/starts the various processes or does a full server down/restart to clear out the problems (temporarily) -- but this is only a workaround, as it is fairly clear that there is an underlying root cause problem which needs attention. |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
Travis, I have to disagree -- this problem has been going on for months now, every to days to a week, the validator and work generator stop validating and generating. Someone then goes in and either stop/starts processes or does a full server restart and the problem 'goes away' -- until it resurfaces. This not a power outage issue (though those have happened), it is an underlying process problem (memory leak, whatever) that is going on and only gets treated symptomatically by a restart. There is a memory leak in the new assimilator, which I have been actively trying to debug. The problem is that it takes about 4-5 days for it to leak enough memory for it to actually crash anything. I'm also working on a way to get the new assimilator to show up on the server status page, but that's kind of on the back burner until we have the nbody simulation stuff up and running (as it's more of a cosmetic problem). |
Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,262 RAC: 263 |
OK -- so you are aware of the memory leak -- and seemingly when it goes boom, that means validation stops and new work generation stops. And, as you now stated, it takes 4 to 5 days (actually it varies from about 3 to 7 days perhaps depending on overall traffic). So, until this dealt with, the suggestion from the peanut gallery is to set up as a stop gap, an automatic server stop, down, and restart process to clear the leak (temporarily) and have this automated process run automatically every 3 days or so. For now though, at this moment, the memory leak HAS surfaced again and no work has been available or validated for the past 8 to 10 hours or more.
|
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
OK -- so you are aware of the memory leak -- and seemingly when it goes boom, that means validation stops and new work generation stops. And, as you now stated, it takes 4 to 5 days (actually it varies from about 3 to 7 days perhaps depending on overall traffic). Actually this was a bug because of the updated assimilator code (to get it working with both the separation workunits and the nbody simulation workunits). It's fixed now and it looks like the assimilator is happily generating work again. |
Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,262 RAC: 263 |
|
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
OK - I see that things are running again -- this is good. Now if we don't see this resurface in the next 3 to 10 days, we can all be happy campers. Well at least in the next few days we should be getting the nbody simulation binaries out there, so if one assimilator crashes hopefully the other will stay up. :) |
Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,262 RAC: 263 |
Oops -- not sure if this is planned or not -- but there is (11:30PM PDT) maintenance going on -- processes are offline again. data-driven web pages milkyway Running upload/download server milkyway Running scheduler milkyway Running feeder milkyway Not Running transitioner milkyway Not Running milkyway_purge milkyway Not Running file_deleter milkyway Not Running |
Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0 |
Should be back. I've been watching it while working on the nbody assimilator. |
©2024 Astroinformatics Group