Note to self… co-incidences in IT DO happen!
Ok, so on Friday I was working from home while recuperating after some surgery (don’t ask). I’m currently working on a large migration project which is really high priority time-scale wise, which is why I was working from home, since I, nor the company I work for can really afford for me to be away from this project for any length of time. So I’m working on a large IBM RS6000 AIX wide node where I need to create an NFS share to their new Red Hat based platform, this required a minor change to the genfilt / mkfilt rules on AIX to allow the new systems to access the NFS shares. I made the one line change and reloaded the firewall on the system, unfortunately this made NIS/YP fault and stop responding, not such a big deal, except that this node is also a NIS server, which meant that users who were authenticating from a frontend running on a thin node were unable to, which started to cause issues quickly, fortunately existing users weren’t affected, however newly connecting users weren’t getting on.
As soon as I’d reloaded the firewall I could see that NIS had failed (inexplicably) and backed out the change, I had to get NIS back online, and reloading the YP services wasn’t working. With the change backed out, I reloaded the firewall again, this time mkfilt just wasn’t having it. The syntax was fine, but the firewall was now blocking access to all services. Remember, I’m working from home, via an SSH session to a host at work with rlogin access to the wide. As soon as the firewall started blocking traffic my remote session died and I was unable to access it. FUCK!
I get on the phone straight away to the DC and asked a colleague of mine, Brian, to re-run the firewall script from the control workstation, which has a direct, non-IP connection to the wide. About 10mins later I get a call saying he’s been able to restart the firewall ok, and I can access the server from my connection again. Phew. NIS is still down though and still refuses to start-up cleanly. A reboot is in order. By this time, I’m pretty much ready to head into the DC so I can be hands-on with the kit when needed. Brian gets in touch with the client and co-ordinates a graceful shutdown of the databases before we initiate a standard reboot.
By the time I arrive at the DC (15mins away by car fortunately), we’ve managed to arrange “unscheduled maintenance” time, and we bounce both the nodes. Everything comes back up perfectly, and users can log back in just fine. We notify the client, and they can see everything’s ok, the databases have come up and everything’s back the way it was.
I get into finding out what caused NIS and the second firewall reload to spanner completely when we get another call from the company saying that LPR print queue jobs are not being passed from the thin node to a 3rd server which is running Caldera Open Linux linux (yeah, I know!). This Caldera box is running Tarantella which provides client-based printing. Essentially, users printed from a terminal on the thin-node, which is mapped to a remote print queue on the Caldera server, and the Tarantella server then maps the user’s printer to their print queue on the Caldera server. Essentially allowing (in a very round-about way) client-based printing from a terminal. For turn of the century stuff this was quite advanced, since there was no way to do this dynamically, from a web-based (HTTPS) client, and without setting up static routed print-queues on the node.
So that’s the background. Now, when we heard about this printing issue, which had been an intermittent problem since the platform had been introduced, but this had normally been resolved by a simple reboot of the Caldera server. We decided that since the nodes had been down, this had likely caused a bottleneck between the servers and that Caldera needed a reboot in order to enable the bottleneck to clear and allow the print queues to start moving again. We bounce the box and the queues still are being held on the thin node. FRAK! I know beyond a doubt that the issue isn’t software firewall related, since my minor change (a) wouldn’t have affected port 515 communications and (b) the firewall is running ok. My boss, John, had become involved around the time we rebooted both the nodes, as he was interested to know what was going on. After being brought up to speed he was convinced that this was a firewall related issue, since the initial cause was firewall related, and that I’d asked our network manager to add new rules to allow NFS between the new and old platforms. I knew it was highly unlikely that the problem was a firewall one since the changes had been backed out, and the system was in it’s normal, default configuration but John felt that the timing was just too close for it to be a coincidence with anything other than a firewall issue. It took us a while, looking at the firewall rules in place, to see if any hits were being matched on the Cisco’s (which they weren’t), telnetting to ports etc all of which were fruitless. It was obvious in my mind that there was something on the Caldera box which was not allowing the LPD daemon to respond properly. After looking through the tarantella logs I checked the /var/log/messages log and saw that the LPD daemon faulted at start-up with the error “not enough disk space”. That old chestnut.
After a little more digging, it turned out that the / partition, having only 2GB of space had slowly been filled up by apache access and error log files since the early 2000’s and had caused the disk to become full. Monitoring hadn’t been set up to check disk space usage, which beggared my belief, but there it is. The apache logs had filled the last of the available disk space at pretty much the exact same time as the AIX system had gone down. All of the time spent wasted checking firewall rules and all the printing problem was related to was a frakking simple thing – disk space.
So the moral of this story is, blind co-incidence DOES happen in this profession, and it’s something that I’ll definitely remember for the rest of my career!


No comments yet.