monitor bucardo replication lag using circonus

         I have been using circonus for monitoring, trending and alerting for any database metrics for quite a long time now. The circonus interface makes the monitoring, trending and alerting setup painless and you can see graph flowing in minutes. Another good thing about Circonus is that you can monitor anything that you can query from database ! This week, the task at my hand was to find a way to monitor bucardo replication lag.  `bucardo_ctl status sync_name` provides very important information that you can rely for trending and alerting purposes.

$ bucardo_ctl status my_slave
Sync name: my_sync
Current state: WAIT:22s (PID = 19500)
Type: pushdelta
Source herd/database: slave_herd / master_herd
Target database: my_slave
Tables in sync: 318
Last good: 23s (time to run: 1m 21s)
Last good time: Feb 29, 2012 15:27:14 Target: my_slave
Ins/Upd/Del: 142 / 0 / 0
Last bad: 1h 45m 9s (time to run: 19m 57s)
Last bad time: Feb 29, 2012 13:42:29 Target: my_slave
Latest bad reason: MCP removing stale q entry
PID file: /var/run/bucardo/bucardo.ctl.sync.my_sync.pid
PID file created: Wed Feb 29 13:42:33 2012
Status: active
Limitdbs: 0
Priority: 0
Checktime: none
Overdue time: 00:00:00
Expired time: 00:00:00
Stayalive: yes Kidsalive: yes
Rebuild index: 0 Do_listen: no
Ping: yes Makedelta: no
Onetimecopy: 0

All the information provided by `bucardo_ctl status` command is important but most interesting thing to monitor is “Last good:”. Last good shows bucardo replication lag on slave server.

Trending in circonus:

Now, I have metric identified for the monitoring. Next step is to find the best way to put the metric into the monitoring tool. After some poking around the output and ways to monitor stuff, I decided to grab the SQL query from bucardo_ctl perl script and stick it into circonus monitoring. Most of the time spent for setting up this monitor was to grab right query from big perl script (bucardo_ctl) and mapping the metric required from the query. After that, here is the query that I plugged  into Circonus in no-time.

SELECT
‘bucardo_last_good’ , round(extract(epoch FROM now()-ended))
FROM
(SELECT * FROM bucardo.q WHERE sync = ‘my_sync’ AND cdate >= now() – interval ‘3 days’
UNION ALL
SELECT * FROM freezer.master_q
WHERE sync = ‘my_sync’ AND cdate >= now() – interval ‘3 days’) AS foo
WHERE ended is NOT NULL AND aborted IS NULL
ORDER BY ended DESC LIMIT 1;


Alerting in circonus:

bucardo_ctl status shows Last good status to “unknown” if replication is broken.

Name Type State PID Last_good Time I/U/D Last_bad Time
===========+=====+========+====+=========+=====+=====+========+=====
my_sync| P |WAIT:35s|7620| unknown | | |36s |1m58s

       In circonus, you could setup rules and relevant severity levels. The most important part is that if the query doesn’t return any row it should page(“unknown” condition) . Circonus provides rule for alerts in case of a metric is absent. Now, I am all set with alerts as well.

Yay! bucardo replication is under monitoring and trending without any hassle! Hopefully, this post will help you next time when you try to put bucardo replication lag under monitoring.

Half a decade with OmniTi, PostgreSQL, FOSS …

By end of this month, I am completing 5 years with OmniTI. Before joining OmniTI, I was an Oracle DBA and worked mostly with closed source databases. I am grateful & fortunate that Theo and Robert provided me opportunity and guidance to work and contribute with open source technologies and most importantly an open source database PostgreSQL ! At OmniTI, if you don’t have hesitation to learn new technlogy, you will have smooth learning experience alongside working with your known technology for the clients. Most importantly, you can have access to technology leaders in the office. Ironically, I never worked for single company longer than a year before joining OmniTI in my career but now I am looking forward to learn and contribute to FOSS community working with OmniTI.

Since joining, I haven’t spent a day without learning a new thing . Are you bored at routine work? Come join OmniTI’s adventure, here I work!

Bravo to OmniTI & PosgreSQL and FOSS community!!

What is pg_extractor ?

In my recent blog post, I wrote about PostgreSQL DBA Handyman toolset. In the list of tools, getddl is one of them. If you are using getddl to get DDL schema and track the daily changes in SVN for production databases, you should consider moving that process to use pg_extractor instead. pg_extractor is the more advance and robust tool for extracting schema as well data using pg_dump. Keith Fiske, an author of the tool, described tool in detail in his blog post. Thanks to Keith for making the schema extraction tool more robust and taking it to next level !

Hopefully, it will help you to have more control over your database in smarter way!

MySQL Foriegn Key create ERROR: MySQL Error Number 1005

Have you ever come across this error?

MySQL Error Number 1005
ERROR 1005 (HY000): Can’t create table ‘./wordpress/#sql-2693_f74e.frm’ (errno: 150)

Jason summarized very good article to track down error reported on his blog.

It helped me , hopefully these tricks will help you as well!!

Post in comment , if it helps to save your day!!

“Shrink” MS SQL Server transaction logfile

          Microsoft SQL Server transactions log file “too large” is common problem. We need to manually shrink the transaction log file as one time operation and after that setting up proper recovery policy and/or keeping mirror server available all the time , we could  avoid  infinite growth of transactions log in future.
          Transaction log file growth is dependent on one of the following Recovery method used for the instance:
 1. Simple
 2. Full
 3. Bulk-Logged

          If your database is in the Simple recovery mode, then the transaction log is truncated of inactive transaction after the checkpoint process occurs.  The checkpoint process writes all modified data pages from memory to disk.  When the checkpoint is performed, the inactive portion of the transaction log is marked as reusable.
          SQL Server 2005 databases are set to the Full recovery model by default.  With the Full or Bulk-Logged recovery mode, inactive transactions remain in the transaction log file until after a Checkpoint is processed and a transaction log backup is made.  It’s important to note that a full backup does not remove inactive transactions from the transaction log.  If  database recovery model is set to Full or Bulk-Logged, then it is absolutely IMPORTANT that we make transaction log backups to go along with  full backups because full backup does not remove inactive transactions from the transaction log. The transaction log backup performs a truncation of the inactive portion of the transaction log and allow them to used for future transactions. Note that truncation of log file does not  shrink already bloated log file that’s the operation we have to do manually one time  and if we set proper procedure in future, we could avoid unnecessary growth of log file. Other factors affecting transaction log to grow are:

  1. Replicated mirrored server falls behind master server
  2. Long running transactions

To solve #1:
          Database mirroring is uses transaction logs for replication in that it requires that the transactions remain in the log until the record has been written to disk on the mirror server. If the mirror server  falls behind the master server , the amount of active log space will grow. In this case, you might need to stop database mirroring, take a log backup that truncates the log, apply that log backup to the mirror database and restart mirroring.

To solve #2: ( it’s non-issue for most of the OLTP servers) :
          If there are open transactions, DBCC OPENTRAN will provide a session_id or SPID of the connection that has the transaction open.  You can pass this session_id to sp_who2 to determine which user has the connection open.

              Following queries can be ued to detemine recovery model and transaction log truncation interval:

SELECT name, recovery_model_desc  FROM sys.databases;

master SIMPLE
tempdb SIMPLE
model FULL
msdb SIMPLE
ReportServer SIMPLE
ReportServerTempDB SIMPLE
PRODUCTION_DB FULL

SELECT name, log_reuse_wait_desc FROM sys.databases;

master ACTIVE_TRANSACTION
tempdb ACTIVE_TRANSACTION
model NOTHING
msdb NOTHING
ReportServer NOTHING
ReportServerTempDB NOTHING
PRODUCTION_DB LOG_BACKUP

          Results from above queries reveals that PRODUCTION_DB database is setup for FULL database backup and transaction log will not be truncated until LOG_BACKUP.  Make sure that you take full backup followed by log backup that will make transactions log file’s inactive blocks to be available for re-write and prevent from growing further. Keep in mind that it will not shrink transaction log file . To shrink the transaction log file you can follow steps described in official docs  OR following is the simple example of identifying and shrinking log file after taking log file backup:

                SELECT name FROM sys.database_files WHERE type_desc = ‘LOG’

         Once I have log file name, I can use the DBCC command to shrink the file. In the following command I try to shrink my log file down to 1GB.
                 DBCC SHRINKFILE (’PRODUCTION_DB_log.log’, 1000)     
       Run the shrink command couple of times during the off-peak hours. Also, please  make sure that your databases are NOT set to auto-shrink. Auto-shrink database might encounter performance problems.

          Hopefully, it will help you next time you come across transaction log file “too large” issue on MS SQL server installation.