Work should be flowing

Author	Message
Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 41484 - Posted: 15 Aug 2010, 22:05:35 UTC - in response to Message 41476. As noted elsewhere, the workaround is to set up a automatic process which stop/starts the various processes or does a full server down/restart to clear out the problems (temporarily) -- but this is only a workaround, as it is fairly clear that there is an underlying root cause problem which needs attention. I'm pretty sure it's because it's the summer and RPI is doing rewiring and moving machines around so labstaff keep shutting the milkyway server down. ID: 41484 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,262 RAC: 263	Message 41487 - Posted: 15 Aug 2010, 22:45:17 UTC - in response to Message 41484. Last modified: 15 Aug 2010, 22:49:55 UTC Travis, I have to disagree -- this problem has been going on for months now, every to days to a week, the validator and work generator stop validating and generating. Someone then goes in and either stop/starts processes or does a full server restart and the problem 'goes away' -- until it resurfaces. This not a power outage issue (though those have happened), it is an underlying process problem (memory leak, whatever) that is going on and only gets treated symptomatically by a restart. The situation -- this instant is a demonstration of this -- the server has reported 'all green', yet no work is available and the awaiting validation population is growing rapidly as work is reporting in with the validator not working. This has been the case since early this morning and this situation has been repeating itself exactly this way every three days to a week for at least a couple of months -- perhaps longer. As noted elsewhere, the workaround is to set up a automatic process which stop/starts the various processes or does a full server down/restart to clear out the problems (temporarily) -- but this is only a workaround, as it is fairly clear that there is an underlying root cause problem which needs attention. I'm pretty sure it's because it's the summer and RPI is doing rewiring and moving machines around so labstaff keep shutting the milkyway server down. ID: 41487 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 41489 - Posted: 15 Aug 2010, 23:21:27 UTC - in response to Message 41487. Travis, I have to disagree -- this problem has been going on for months now, every to days to a week, the validator and work generator stop validating and generating. Someone then goes in and either stop/starts processes or does a full server restart and the problem 'goes away' -- until it resurfaces. This not a power outage issue (though those have happened), it is an underlying process problem (memory leak, whatever) that is going on and only gets treated symptomatically by a restart. The situation -- this instant is a demonstration of this -- the server has reported 'all green', yet no work is available and the awaiting validation population is growing rapidly as work is reporting in with the validator not working. This has been the case since early this morning and this situation has been repeating itself exactly this way every three days to a week for at least a couple of months -- perhaps longer. There is a memory leak in the new assimilator, which I have been actively trying to debug. The problem is that it takes about 4-5 days for it to leak enough memory for it to actually crash anything. I'm also working on a way to get the new assimilator to show up on the server status page, but that's kind of on the back burner until we have the nbody simulation stuff up and running (as it's more of a cosmetic problem). ID: 41489 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,262 RAC: 263	Message 41490 - Posted: 15 Aug 2010, 23:39:06 UTC - in response to Message 41489. OK -- so you are aware of the memory leak -- and seemingly when it goes boom, that means validation stops and new work generation stops. And, as you now stated, it takes 4 to 5 days (actually it varies from about 3 to 7 days perhaps depending on overall traffic). So, until this dealt with, the suggestion from the peanut gallery is to set up as a stop gap, an automatic server stop, down, and restart process to clear the leak (temporarily) and have this automated process run automatically every 3 days or so. For now though, at this moment, the memory leak HAS surfaced again and no work has been available or validated for the past 8 to 10 hours or more. There is a memory leak in the new assimilator, which I have been actively trying to debug. The problem is that it takes about 4-5 days for it to leak enough memory for it to actually crash anything. I'm also working on a way to get the new assimilator to show up on the server status page, but that's kind of on the back burner until we have the nbody simulation stuff up and running (as it's more of a cosmetic problem). ID: 41490 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 41491 - Posted: 15 Aug 2010, 23:41:45 UTC - in response to Message 41490. OK -- so you are aware of the memory leak -- and seemingly when it goes boom, that means validation stops and new work generation stops. And, as you now stated, it takes 4 to 5 days (actually it varies from about 3 to 7 days perhaps depending on overall traffic). So, until this dealt with, the suggestion from the peanut gallery is to set up as a stop gap, an automatic server stop, down, and restart process to clear the leak (temporarily) and have this automated process run automatically every 3 days or so. For now though, at this moment, the memory leak HAS surfaced again and no work has been available or validated for the past 8 to 10 hours or more. Actually this was a bug because of the updated assimilator code (to get it working with both the separation workunits and the nbody simulation workunits). It's fixed now and it looks like the assimilator is happily generating work again. ID: 41491 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,262 RAC: 263	Message 41495 - Posted: 16 Aug 2010, 1:07:43 UTC - in response to Message 41491. OK - I see that things are running again -- this is good. Now if we don't see this resurface in the next 3 to 10 days, we can all be happy campers. ID: 41495 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 41496 - Posted: 16 Aug 2010, 1:10:14 UTC - in response to Message 41495. OK - I see that things are running again -- this is good. Now if we don't see this resurface in the next 3 to 10 days, we can all be happy campers. Well at least in the next few days we should be getting the nbody simulation binaries out there, so if one assimilator crashes hopefully the other will stay up. :) ID: 41496 · Rating: 0 · rate: / Reply Quote

BarryAZ Send message Joined: 1 Sep 08 Posts: 520 Credit: 302,528,262 RAC: 263	Message 41498 - Posted: 16 Aug 2010, 6:29:21 UTC - in response to Message 41496. Last modified: 16 Aug 2010, 6:35:00 UTC Oops -- not sure if this is planned or not -- but there is (11:30PM PDT) maintenance going on -- processes are offline again. data-driven web pages milkyway Running upload/download server milkyway Running scheduler milkyway Running feeder milkyway Not Running transitioner milkyway Not Running milkyway_purge milkyway Not Running file_deleter milkyway Not Running ID: 41498 · Rating: 0 · rate: / Reply Quote

Travis Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 30 Aug 07 Posts: 2046 Credit: 26,480 RAC: 0	Message 41500 - Posted: 16 Aug 2010, 8:10:43 UTC - in response to Message 41498. Should be back. I've been watching it while working on the nbody assimilator. ID: 41500 · Rating: 0 · rate: / Reply Quote