Sunday 3 February 2013

Weblin is Down

A first analysis indicates connection problems between the cluster hosts. Even ssh between hosts and to hosts from outside is slow. Database server has lots of connections. Web server has lots of connections. Maybe routing issues on the server hosting side?

Weblin is effectively down since more than 1 h.

Update 15:01: Rebooting all servers, just in case. Sometimes helps with routing problems. Did not help.

Update 15:21: 20 minutes later still WTF

Update 15:30: All my putty/ssh windows have just disconnected and the servers seem to be faster now. Lools like the hoster (or someone else) has reset a networking equipment.

Update 15:36: Login fails. The DB server does not accept connections from the web server, because of "too many connection errors". Interesting feature. probaby a follow up error. Solved by rebooting the DB once more.

Update 21:55: Another outage from about 17:20 to 18:30. Systems recovered on their own.

Saturday 12 January 2013

Messaging Server Offline

The weblin client is unable to connect. The messaging server is not reachable, seems to be offline. Working on it.

Update 20:05 h: Reboot - back online.

Apparently, the thing was offline for about 1 hour, while the admin was out for lunch. No explanation yet.

Saturday 11 August 2012

Portal and Client Login Offline

Friday 8 pm (20:00 h) the portal including the client login script went offline. Operation has been restored.

Clients which were online continued to run. But new client logins were not possible. In other words: you could not start weblin.

Sorry for not noticing and for not reading email in the last 12 hours. Thanks to the people who notified me by email and on social netwoks.

Analysis:

There were too many apache processes running. I increased the number of max apache processes, because there is plenty of memory available.

But the main question is, why some apache processes do not terminate. The graph shows, that the behaviour started about 2 months ago in June. Until then there were only few processes running. There were no (known) configuration changes in June. I will observe the behavior and try to check what these processes were doing last. Unfortunately checking what they were serving before things stopped is not possible after things stopped.

Monday 21 May 2012

Unexpected Outage

The location mapping server seems to be down. Weblin and the Weblin servers work, but it is not possible to enter a page. Trying to fix the problem by moving the location mapping server to a different host.

UPDATE: Looks like the service has been restored. Reconnecting all clients to force new room enter.

Saturday 12 November 2011

Migration Downtime

Scheduled downtime Sunday morning. http://blog.weblin.com/2011/11/moving-to-bigger-server.html

Sunday, Nov. 13. 2011, 06:00 h - 10:00 h CET if everything works (as it always does)

UPDATE: Downtime extended to 13:00 h CET (no problem, just late)

Tuesday 10 March 2009

Scheduled downtime

Server migration tomorrow morning, March, 11th 2009, from 7:00 am UTC until 12:00 am UTC. All services will be unreachable.

Wednesday 25 February 2009

Scheduled downtime

Server migration tomorrow morning, February, 26th 2009, from 7:00 am UTC until 10:00 am UTC. All services will be unreachable.