Hello,
I feel myself obligated to write an explanation what has happened with polarhome during last week.
What the users experienced was that polarhome.com was not reachable, domain not resolved and if it was alive for few hours it crashed without any reason.
Well, usually in autumn the gate that is placed in the attic suffer from condensed humidity and this issue I usually solve with a light bulb that raises the temperature in the box and hard disk survives.
During last Sunday I was confident that this was a problem for the gate's crash.
Kernel crashed with a code dump.
All input that I had was a randomly produced 30-40 lines of kernel dump that showed some network related and interrupt related functions.
My first feeling was that the PCI slots damaged the network cards. But when I changed the network cards to new one and this did not help I thought that might be good to try the disk in the whole new box.
I took ubuntu, because this host has the less number of users – but this did not help either.
One night in the middle of the week I found a function call in the dump that was suspicious – nf_conntrack. It looks very similar to one familiar one very important one for polarhome: ip_conntrack - that is responsible for all NAT functionality.
It was a revelation to see that kernel development moved from ip_conntrack to nf_conntrack and I have not noticed that. The functionality still works, the kernel modules still work even if I load them like ip_*something.
After some testing I have find out that this might cause the kernel dump – because netfilter does not get through the iptables config parameters.
The kernel simply runs out of the allocated conntrack memory (because polarhome has huge traffic and kernel needs to be tuned in order to work with full traffic load)
I have tried to pass parameters to netfilter and this is the reason why the box survived more than 8 hours during the day.
Seems the raised limit still does not fit because it died in that afternoon again.
In fact to fill up this conntrack table it is enough to have few thousand servers that access some services behind the gateway server in the same time.
I have reported this bug(more precisely I have confirmed an already URGENT, but not really prioritized bug) to Fedora
https://bugzilla.redhat.com/show_bug.cgi?id=259501
The solution was to downgrade the kernel, because this is a typical kernel bug that pops up just on the very busy servers.
polarhome is a rather specific environment and there were also issues that needs such an environment to reproduce - like few years ago when usernames had been chopped when UID was greater then 64k, or like that issue with messing up the quotas over some high number of users etc.
Regardless, I am sorry for that that it took so long time.
The main problem was in approach. I assumed that Linux is OK and the problem must be on my side.
Another reason is that I still need to work 8 hours on the job, and drive my daughter to piano, ballet and other dance lessons, some language lessons and also spend some time with the family etc and I can not reach the box either when it is down.
But I can convince you that I gave my best with all my knowledge - unfortunately the performance we not that impressive this time
Also I would like to thank to polarhome users for positive approach, patience and trust specially for users that offered hardware and administration help in such a tuff situation.
polarhome will survive as long polarhome has such users.