Sunday 3 February 2013

Weblin is Down

A first analysis indicates connection problems between the cluster hosts. Even ssh between hosts and to hosts from outside is slow. Database server has lots of connections. Web server has lots of connections. Maybe routing issues on the server hosting side?

Weblin is effectively down since more than 1 h.

Update 15:01: Rebooting all servers, just in case. Sometimes helps with routing problems. Did not help.

Update 15:21: 20 minutes later still WTF

Update 15:30: All my putty/ssh windows have just disconnected and the servers seem to be faster now. Lools like the hoster (or someone else) has reset a networking equipment.

Update 15:36: Login fails. The DB server does not accept connections from the web server, because of "too many connection errors". Interesting feature. probaby a follow up error. Solved by rebooting the DB once more.

Update 21:55: Another outage from about 17:20 to 18:30. Systems recovered on their own.