Friday, January 17, 2014

Cluster B(l)uster

Dear Diary,

Well, there's no easy way to say it...but I suppose you should know:  I broke the cluster.

For several weeks, those who use the computing cluster have complained that it's been abnormally slow.  In fact, I even received an email from another graduate student concerned that I might be using the resource improperly.  I denied it at the time, but it appears that I was complaining about myself for nearly a few months.

In retrospect, I should have known that something was wrong because the data was sometimes coming out strangely.

You see, you submit jobs to a temporary directory where it submits everything to the cluster machines.  If you need to copy back a lot of files from the temporary directory, then it gums up the system.   And I'm guilty here.  To fix it, my little script submission file now only has ">out" instead of ">(home directory)/out" to prevent it from copying everything back.  Further, I only needed two files out of the possibly thousands of files that are output.  So, I just added 'make clean' to the submission script and that gets rid of everything before it comes back to me.

Charge 2 is requesting the correct number of computers and threads.  I had requested one computer and one thread, but a command for OpenMP (to parallelize the code and make it go faster) was set to request 8 threads.  So, my program was unhealthily bumping up against other's corrupting everyone's data and making it so that writing files out was not working.  This really made things messy.

The key here is to not look at the output files.  It says, simply, "nodes".  I had requested 1 node (aka computer) with ppn=8 (for 8 threads), but it reported 8 nodes...So, I should trust I filled out the fields correctly instead of looking at the output file...Summarily: nodes=1:ppn=8 (for 8 threads).

Now, none of this is particularly egregious...and I would have gotten away with it (if it weren't for you meddling kids!) but I was running so much data that little mistakes made a big difference.

I'm going to go and barricade myself over the weekend and lay low until this whole thing blows over...oh, and I'll run that data that I owe Li.  These fixes actually come at a good time because running jobs over the weekend is best (less use the cluster).

No comments:

Post a Comment