Obviously the improved DB connection did not help as expected. There was a 2 hour outage on one of the cluster nodes.
This is a growth problem as not only server load grows, but also the effective coupling of sub systems by way of their increasingly loaded interfaces. Events once isolated begin to propagate between sub systems.
Investigation is under way. In addition, an alternate solution will be implemented today.