GamerzCrib's Forums

02/23/2008 3:50 pm CyberMage +1
Our first real glitch
Ugh, I don't know why this didn't show during the weeks before I moved the servers to the datacenter, but now the second server is throwing hard drive errors. It remounted the file system as read-only, which caused the database to not be able to sync with the master database. This was the cause of the weird session problems in the last hour.

I've taken that server out of the loop temporarily until I can figure out what to do with it.
----
This is my sig. You wish your sig was this cool.
02/23/2008 5:50 pm CyberMage +1
RE: Our first real glitch
Well it's definitely an incompatibility issue between the motherboards and the Raptor hard drives. The one server using standard 7200 RPM drives isn't having a problem.

I've restarted the downed server and set it to continue replication but not be used for anything - so no queries will actually hit that server. This way at least I'm still keeping the database in sync.

I also took the third server and activated it's database server and started replication on it (those who saw the "site down" notice happened to visit during those 2 minutes while I was linking the replication) It's now taking the brunt of the database requests.

Since 2/3rds of my servers are affected by this hardware glitch, I expect a few days of rough weather. As long as the master database server stays up we'll be in good shape, but I'm seeing signs in it's logs that we might have a problem there.

Anyway, that's the state of the union after around 18 hours online. Just my luck. These are things I didn't see really until the servers started taking some significant activity - we're averaging around 6 to 10 users on the server at any given time right now.
----
This is my sig. You wish your sig was this cool.
02/24/2008 2:42 pm CyberMage +0
RE: Our first real glitch
I just uploaded a special utility to the servers. Every five minutes the primary server will test the slave databases and take them out of the loop if one goes down. It will also page me when this happens.

So, if we have a repeat of the slave database dying without killing the entire slave machine (which would already notify me) then at worst the site will be unstable for up to 5 minutes until the watchdog notices and blacklists the server.

Later I'll set that on a 1 minute cycle, but for now I piggy-backed it onto one of the other 5 minute processes I'm already running.
----
This is my sig. You wish your sig was this cool.

Website Copyright 2008 DoUHearMe.com, Inc.
User provided content remains the responsibility of the poster.
FAQ | Terms Of Use | Privacy Policy | Guidelines