Polarhome Community

by **zoli** » Sun Nov 11, 2007 9:11 am

Hello,

I feel myself obligated to write an explanation what has happened with polarhome during last week.

What the users experienced was that polarhome.com was not reachable, domain not resolved and if it was alive for few hours it crashed without any reason.

Well, usually in autumn the gate that is placed in the attic suffer from condensed humidity and this issue I usually solve with a light bulb that raises the temperature in the box and hard disk survives.

During last Sunday I was confident that this was a problem for the gate's crash.
Kernel crashed with a code dump.

All input that I had was a randomly produced 30-40 lines of kernel dump that showed some network related and interrupt related functions.

My first feeling was that the PCI slots damaged the network cards. But when I changed the network cards to new one and this did not help I thought that might be good to try the disk in the whole new box.

I took ubuntu, because this host has the less number of users – but this did not help either.

One night in the middle of the week I found a function call in the dump that was suspicious – nf_conntrack. It looks very similar to one familiar one very important one for polarhome: ip_conntrack - that is responsible for all NAT functionality.

It was a revelation to see that kernel development moved from ip_conntrack to nf_conntrack and I have not noticed that. The functionality still works, the kernel modules still work even if I load them like ip_*something.

After some testing I have find out that this might cause the kernel dump – because netfilter does not get through the iptables config parameters.

The kernel simply runs out of the allocated conntrack memory (because polarhome has huge traffic and kernel needs to be tuned in order to work with full traffic load)

I have tried to pass parameters to netfilter and this is the reason why the box survived more than 8 hours during the day.

Seems the raised limit still does not fit because it died in that afternoon again.

In fact to fill up this conntrack table it is enough to have few thousand servers that access some services behind the gateway server in the same time.

I have reported this bug(more precisely I have confirmed an already URGENT, but not really prioritized bug) to Fedora
https://bugzilla.redhat.com/show_bug.cgi?id=259501

The solution was to downgrade the kernel, because this is a typical kernel bug that pops up just on the very busy servers.

polarhome is a rather specific environment and there were also issues that needs such an environment to reproduce - like few years ago when usernames had been chopped when UID was greater then 64k, or like that issue with messing up the quotas over some high number of users etc.

Regardless, I am sorry for that that it took so long time.

The main problem was in approach. I assumed that Linux is OK and the problem must be on my side.

Another reason is that I still need to work 8 hours on the job, and drive my daughter to piano, ballet and other dance lessons, some language lessons and also spend some time with the family etc and I can not reach the box either when it is down.

But I can convince you that I gave my best with all my knowledge - unfortunately the performance we not that impressive this time

Also I would like to thank to polarhome users for positive approach, patience and trust specially for users that offered hardware and administration help in such a tuff situation.

polarhome will survive as long polarhome has such users.

by **sjaz** » Sun Nov 11, 2007 9:45 am

Zoli,

I have to take my hat off to you.

Not only do you singlehandedly maintain the servers, you have a family, a young daughter, AND a full time job.

The downtime experienced is in my opinion a far lower priority than that of your family life and I applaud you in your actions.

Enjoy your weekend my friend.

by **Matej** » Sun Nov 11, 2007 1:21 pm

I have to agree. It's remarkable that you find the time for such a project and keep it going for so long. All users should be thankful for that.

by **miker_alpha** » Sun Nov 11, 2007 7:06 pm

Well done Zoli.
Just as a comparison: A large hosting co. on which my son has a site was down for *6 days* last week with no word of explanation (Google for Navisite)

Polarhome is a remarkable effort, and some remarkable people gather round it - Not least the Z-man!!

All the best,
MikeR

by **zoli** » Sun Nov 11, 2007 9:55 pm

thank you very much.

by **afonic** » Mon Nov 12, 2007 9:21 am

Zoli you're seriously doing an awesome job, sometimes I imagine myself even running one server in my room and I give up when I think of problems I may have!

This was an educating story too.

However I may suggest that if you need something more stable for the gate, just use CentOS. I had some issues in a server I had Fedora running (just an example if I'd used RAID the system would freeze after a while) and I had to deal with the limited support cycle which would force me to ask for updates very often (FC3 -> FC5 etc). Moving to CentOS 4.3 made all these problems disappear and actually that server now has 210 days of uptime (and it is a server with loads of traffic).

CentOS is a rebuild of Redhat Enterprise Linux based of the source RPMs Redhat releases and has a long support cycle. (4.3 will get updates up to 2012 and the current release 5 probably up to 2014)

http://www.centos.org/

by **sjaz** » Mon Nov 12, 2007 9:26 am

I'd second that, I've had problems with fedora on a server, not quite this extensive though.

I see fedora as a debian 'unstable' equivalent on servers.

Or even if it needs to be light, we could consider netbsd, or openbsd.

by **afonic** » Mon Nov 12, 2007 9:33 am

I suggested CentOS since I get a feeling that Z is the "Redhat kind of guy" if you get what I mean.

by **zoli** » Mon Nov 12, 2007 3:27 pm

afonic wrote:Z is the "Redhat kind of guy"

it is true - I do like rpms and yum otherwise I do not care about the distribution as long it works.

afonic's suggestion to run Ubuntu on my laptop was a full match. Both SuSE and Fedora failed with some hardware issue while Ubuntu is still working nice and stable.

I will definitely give a try for CentOS, but first of all I would like to check the list of available RPMs. It is very convenient in Fedora that except MailScanner I do have all packets managed and updated by Fedora on the gate - that helps a lot in administration.

This is on of the reason why gate is not running FreeBSD. FreeBSD is fast, much faster then Linux. very stable on high load (even under DoS attacks). I like it very much, but the update/upgrade procedure is a real pain (lot of work, kernel compile, uncertain issues, aaaah )... and the gate should be all the time up - you may name it as a single point of failure of polarhome.

So, welcome CentOS among the nominees

by **sjaz** » Mon Nov 12, 2007 5:13 pm

Hehe, I understand!!

Yes, I run ubuntu too

by **afonic** » Mon Nov 12, 2007 6:27 pm

Well when it comes to distros I am your man.

It would not be extreme to say I have tried almost all of them. If you check the top 50 in Distrowatch I've tested them all except maybe 4-5 (even the BSDs) and some more from below the 50 line as well. And I have used Fedora, CentOS, Debian, RHEL in a dedicated server. I have tried so many over the past years that I decided to try and make a buck by reviewing the new ones.

http://www.dvd-guides.com/content/category/4/107/110/

Right now I manage two PCs with CentOS 4.3, one dual Opteron server at Liquidweb and another one at Hetzner.de and I can't be more happy with their stability and ease to keep up to date.

And especially the second server is just an Athlon64 with 2GB RAM and it runs LOADS of things besides the standard LAMP. (shoutcast, bncs, irc server, 2 game servers, eggdrop, teamspeak, ventrilo, bittorrent)

by **rus** » Sun Nov 18, 2007 12:15 pm

Thank you, Zoltan.

I thought polarhome died as lots of other free shells. However, I knew how ebullient Zoltan is, the man who had already recovered the gate before. But the sytem is alive! Thank you, you've really proved that free systems can exist in our world.

Bye.

by **zoli** » Wed Feb 27, 2008 8:27 am

Hello,

Good news - after two unsuccessful patches, seems the kernel issue is finally solved.

gate has been running during last week on the latest F8 kernel without any problem.

Polarhome Community

Fedora problems

Fedora problems

Good work!

Thank you

Who is online