Message boards :
News :
News General
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 |
I am sure you all remember the w/u =7 tasks that caused invalidations. Well, this morning I encountered the son of w/u=7. Its signature is w/u=5 and it causes validation errors. Que paso? Not sure I got it, but to paraphrase Curly of the 3 stooges, "I'm tryin' to think, but nothin's happening!" |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,992,769 RAC: 29,470 |
@Frank Further to the post a couple of places above... One of the two Invalid results mentioned now has three wingmen, all of whom validated successfully! Fortunately, one of them was another Windows CPU task, which allowed an easy comparison of the results (result structure being the same). Ignoring the bit about Lua script errors (which everyone gets!), the successful task had this for the first "job"... Switching to Parameter File 'astronomy_parameters.txt' <number_WUs> 5 </number_WUs> <number_params_per_WU> 20 </number_params_per_WU> Using SSE4.1 path Integral 0 time = 818.259082 s Running likelihood with 47431 stars Likelihood time = 1.152013 s <background_integral> 0.000395229940402 </background_integral> <stream_integral> 3.068682680690464 107.536654276206780 90.712861191606876 </stream_integral> <background_likelihood> -6.591808580671233 </background_likelihood> <stream_only_likelihood> -72.128164401557285 -2.892107495485999 -13.431452465675566 </stream_only_likelihood> <search_likelihood> -2.774016998251114 </search_likelihood> whilst the invalid one has this... Switching to Parameter File 'astronomy_parameters.txt' <number_WUs> 5 </number_WUs> <number_params_per_WU> 20 </number_params_per_WU> Using SSE4.1 path Failed to find header in checkpoint file Failed to read state Failed to calculate likelihood Both seemed to produce numerically identical results for the remaining four items, so whatever caused the issue for the first job didn't spill over into the others... As I don't run Windows, and don't do Separation CPU tasks, I can't really offer any explanation :-( At least it doesn't appear to be the return of the ill-formed tasks! Cheers - Al. |
Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0 |
5 w/u the son of 7 w/u. Today, I encountered two that invalidated the tasks. I think the 5 w/u corruption is a sickness (a variant of the 7 w/u decease) that can be present in untold numbers of tasks yet to be run. Yes, it's a pandemic. It has to be corrected. And, no there were no computer error, except those errors committed by the computer that built the tasks. |
Send message Joined: 16 Mar 10 Posts: 213 Credit: 108,992,769 RAC: 29,470 |
5 w/u the son of 7 w/u. Today, I encountered two that invalidated the tasks. I think the 5 w/u corruption is a sickness (a variant of the 7 w/u decease) that can be present in untold numbers of tasks yet to be run. Yes, it's a pandemic. It has to be corrected. I notice that the two newly invalidated tasks appear to have the same issue I remarked upon above. I also notice that there is a retry in progress for both of them at the time I composed this message... Until enough wingmen have responded and also failed for one reason or another, there's no proof that there's something wrong with the BOINC work unit from which the tasks are built. And, if the reason for failure is to do with the wingman's system, that says nothing about the actual tasks as the limit on errors is set at 2 results so any other results may get tagged as "Completed, can't validate"... On the "can't validate" point - I notice that you also got some of those on one of your systems, and in each case there were two wingmen that ended up in an Error state; one of the systems had a GPU that can't do double precision (so no surprise there!) and the other was a Linux system using an OpenCL version observed to cause problems in the past. Your results (and those of the other wingman who couldn't validate on those units) would almost certainly have validated were it not for those two non-task-related errors getting in the way -- processing of those tasks was not helped by the long delay caused by the server problems :-( And I repeat that these tasks are supposed to be 5 w/u, but I suspect you know that, and it does make for a catchy tag-line! That said, declare a work unit pandemic if and when your wingmen start to show identical symptoms - until then, watch and wait... And if you are genuinely interested in trying to find out why some of your tasks don't handle the first sub-task properly, a thread in the "Number crunching" forum is probably a good idea, rather than pursuing it in a News thread! :-) Cheers - Al. |
Send message Joined: 10 Apr 19 Posts: 408 Credit: 120,203,200 RAC: 0 |
I am sure you all remember the w/u =7 tasks that caused invalidations. Well, this morning I encountered the son of w/u=7. Its signature is w/u=5 and it causes validation errors. Que paso? WU = 5 is normal, there is currently one test run up that has a bundle size of 5. All of Jake's WUs had bundle size 5, the only reason that mine have 4 is because we are trying to fit more streams per stripe than Jake was. |
Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0 |
I have a simple question. Are we done here or has MW cratered? Well, maybe MW is just over-subscribed and don't need all the computing power they have available. MW seems hopelessly befuddled. The supply of runnable tasks is sporadic, at best. Validation is a bad joke, all the validation required copies of tasks don't get sent. Tasks error out for no response after. 2 minutes in the client. What kind of a project plan could tolerate 8 million unsent N-Body tasks? Why are there 0 Separation tasks, especially on a weekend? Talk about erratic, what about the servers? I could continue this list but it isn't necessary. There are a ton of problems. There is no money to generate fixes. So, are we done here? |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 |
Are we done here or has MW cratered? I hope not. I think there is good work to be done here. I recall in Prof. Heidi's video, that she would like to run a bunch of simultaneous star streams, similar to the orphan-cheneb stream in the video. Well, maybe MW is just over-subscribed and don't need all the computing power they have available. Good question. Been wondering that myself. MW seems hopelessly befuddled. That, I think, goes to leadership, and available resources. I don't think RPI management is willing to devote what it takes, in manpower, money, and compute resources to kick it up a notch or two. Money is very tight right now. The supply of runnable tasks is sporadic, at best. Yep. Validation is a bad joke, all the validation required copies of tasks don't get sent. Yep, something funky is going on with the servers. For now, would a periodic reboot of the whole enchilada be appropriate? Say once every 2 days? Once a week? Tasks error out for no response after. 2 minutes in the client. Yep, although I don't think I have seen that in several weeks. Shouldn't happen at all. What kind of a project plan could tolerate 8 million unsent N-Body tasks? A week or so ago it was at 17 million. I suspect a configuration error in the server side. Plus there was a hard drive failure that was very difficult to recover from. Why are there 0 Separation tasks, especially on a weekend? Good question again. Talk about erratic, what about the servers? Funny you should mention that. I was just looking into Dell severs to see what it would cost to R&R the whole lot of 'em for faster performance. (Assuming server performance is the issue, don't know at this point, but something is going on) I could continue this list but it isn't necessary. Maybe you should. Perhaps the active posters here could formulate a list of things we would like to see addressed? There are a ton of problems. There is no money to generate fixes. So, are we done here? Good post overall. Glad to see it. +1 P.S. Also, could be the 25 year old boinc software as well. It might not be up to the large number of volunteers offering compute cycles, on ever faster computers, and now with screaming hot GPU processors. . |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
Are we done here or has MW cratered? I hope not. I think there is good work to be done here. I recall in Prof. Heidi's video, that she would like to run a bunch of simultaneous star streams, similar to the orphan-cheneb stream in the video. I sure hope not too!! Why are there 0 Separation tasks, especially on a weekend? Good question again. Simply because people go home for the weekend and the staff that do stay aren't there to 'manage' the Servers. P.S. Also, could be the 25 year old boinc software as well. It might not be up to the large number of volunteers offering compute cycles, on ever faster computers, and now with screaming hot GPU processors. . Universe was just overwhelmed saying "The problem is now in very large number of concurrent HTTP connections. Our server handles about 300 simultaneous connections and it is maximum what it can do" Back in the days of dialup Seti did a test of how long it took for each pc to 'request a connection from the Server, the Server to acknowledge that request and open a port, the pc to start sending data and the Server to receive it and for the Server o then say goodbye to the pc and then close the port, it was in the 5 second range, obviously todays connection and Servers are significantly faster but time and number of users is not on the Projects side in all this. Also as you said above todays computers can get thru data significantly faster than just 5 years ago and gpu's are even faster than that!! I think they are going to have to separate the cpu and gpu tasks into different Servers, or maybe even the NBody and Separation tasks, and that way each Server can take some of the load that the current Server is obviously being overwhelmed with. As you also said though that requires money or a donation of hardware that is sufficiently upto date for the Project to accept it. Years ago I got a Server from my son who got it from his school who got it for the US State Dept as a surplus thing for the IT class my son was in, they played with it for the whole school year then the teacher wanted it just gone so my sone loaded it into his car and brought it to me. It was over 2 feet square and tall and VERY VERY heavy and loaded with scsi drives. After playing with it for over 6 months, including buying new memory and drives for it etc etc I offered it to Seti when they said they were in desperate need of a Server. i sent them pictures and they said 'thanks but no thanks' and said if I wanted to ship it to them they would figure out something to do with it. I lived on the East Coast and said thanks anyway so I found a guy a work who wanted to learn how to make the Server software work so he could get promoted, I gave it to him and he loaded it into his trunk and it made the car MUCH lower in the back. He did get the promotion!! |
Send message Joined: 2 Nov 10 Posts: 25 Credit: 1,894,269,109 RAC: 0 |
I don't think overload of the servers is a problem. Just look at the numbers of users active on a given day. It has been hanging around 6,000 for weeks and the number of active contributors is around 15,000. However, I have seen the response time of the servers stretch out substantially. I believe the slowness of server response may lie on the Internet. Could be electronic interference or overload of an intermediate server (Internet speed will always be controlled by the slowest server and collisions simply stop transmission until there is a channel open). On the Errors by "Timed out - no response" yesterday I encountered 27 of them. They did not error in 2 minutes but rather in 2 hours. If reboot can stabilize server operation I would encourage it. Probably in the dead of night, I am encouraged by your attitude and intelligence. Maybe the crater can wait. How ever the befuddlement fix can not. |
Send message Joined: 22 May 11 Posts: 71 Credit: 5,685,114 RAC: 0 |
The problem is with the fact that night is relative. On the other side of the Earth it will be day. |
Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0 |
befuddlement fix I love that word - hits the nail right on the head! |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 |
Wow. I had not even considered that. Let's say 10,000 participating computers( a little bit high for now), and they ping every 2 minutes, as Keith does. So 5,000/minute, or 83 pings per second. I have no idea, but it seems like a lot to me. Especially when you dog-pile on type of tasks, quantity to send of each type, ensure no duplicates, receive and sort returned tasks, data base updates, yada yada yada, and pretty soon you are at 100% CPU utilization. I'm in violent agreement with ya on splitting up the server load to CPU separation, GPU separation(and that might even be subdivided into AMD, NVIDIA, and Intel types, just because they run so dang fast.) And also CPU n body. All this depends on how many simultaneous connections are expected if and when Prof Heidi kick off the multi stream project. @ Tom, how many simultaneous connections are you seeing these days? |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 |
Years ago I got a Server from my son who got it from his school who got it for the US State Dept as a surplus thing for the IT class my son was in, they played with it for the whole school year then the teacher wanted it just gone so my sone loaded it into his car and brought it to me. It was over 2 feet square and tall and VERY VERY heavy and loaded with scsi drives. After playing with it for over 6 months, including buying new memory and drives for it etc etc I offered it to Seti when they said they were in desperate need of a Server. i sent them pictures and they said 'thanks but no thanks' and said if I wanted to ship it to them they would figure out something to do with it. I lived on the East Coast and said thanks anyway so I found a guy a work who wanted to learn how to make the Server software work so he could get promoted, I gave it to him and he loaded it into his trunk and it made the car MUCH lower in the back. He did get the promotion!!Very cool. I tried to give away my old IMSAI 8080. No joy. But it was a beast in its day. |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 |
The problem is with the fact that night is relative.Yeah. Good point. Perhaps reboot at the lowest usage point, whatever that is...? |
Send message Joined: 12 Nov 21 Posts: 236 Credit: 575,038,236 RAC: 0 |
I don't think overload of the servers is a problem. Just look at the numbers of users active on a given day. It has been hanging around 6,000 for weeks and the number of active contributors is around 15,000.Truth be told, I am soooo captured by this project. I am slowly, very slowly, gathering info on the project characteristics to see if if it can be tweaked in some way to support a more robust though-put at the client level. If tweaking won't get us there, then perhaps a campaign to rebuild the whole MW@H infrastructure from the ground up might be in order. That's very extreme. Prolly gonna take a sugar daddy or 2 to fund it. With or without boinc. (I was in the Find-a-drug cancer project lo, these many years ago, and it did not use boinc. I screened 525,000,000 molecules, and found 17 that had anti cancer properties.) Obviously strong support from RPI will be required too. This is all Blue Sky Smack Talk right now, but could bear fruit down the line somewhere.... |
Send message Joined: 11 Mar 22 Posts: 42 Credit: 21,902,543 RAC: 0 |
If tweaking won't get us there, then perhaps a campaign to rebuild the whole MW@H infrastructure from the ground up might be in order. That's very extreme. Prolly gonna take a sugar daddy or 2 to fund it. Crowd funding could help with raising money for more capabable hardware, However, one enthusiastic manager like Tom Donlon is just not enough to handle the rebuilt of the MW software. We would need a team of 10 Donlons to do that. |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
The problem is with the fact that night is relative. The problem is they go home at night and weekends , their time, so unless they can get someone to come in and shut it down and then restart it and make sure everything is running that may be a no go for them. Now they could do it at a fixed time every week for example to see if things run more smoothly doing that. remember Seti used to shut down every Tuesday for 'maintenance'. |
Send message Joined: 8 May 09 Posts: 3339 Credit: 524,010,781 RAC: 0 |
I don't think overload of the servers is a problem. Just look at the numbers of users active on a given day. It has been hanging around 6,000 for weeks and the number of active contributors is around 15,000. I think it would be a bad thing if MilkyWay went private and/or non Boinc, we need projects like MW to give those who still have eyes on the sky but can't or don't have the time to do anything about figuring out what's up there and how our little blue World fits into it. It also helps prove that Science is not just for those in white coats sitting in a lab or with their eyes glued to a screen someplace, it's also for the little guy sitting at home who just wants to feel they can do something about it. Kinda like all the Covid Boinc projects that popped up, people felt like they were a part of the solution not just going along with the flow wherever it leads. |
Send message Joined: 14 Oct 19 Posts: 6 Credit: 3,455,991 RAC: 242 |
MilkyWay is violating the BOINC Computing Preferences, running when my Linux Ubuntu 22.04 LTS system is in use. |
Send message Joined: 22 May 11 Posts: 71 Credit: 5,685,114 RAC: 0 |
Boinc will do so if you set it to run always instead of run based on preferences. |
©2024 Astroinformatics Group