Earlier this week, we got an email from DonorsChoose.org with the need of “a serious Postgres expert” to solve the problem they have been experiencing and blocker for number of projects at hand. They heard good things about OmniTI from technology clubs and communities in NYC.
DonorsChoose.org is an online charity that makes it easy for anyone to help students in need. Public school teachers from every corner of America post classroom project requests on their site, and you can give any amount to the project that most inspires you.
This year, in beginning of July , they migrated Postgres database server from virtual hardware to high capacity bare-metal server and upgraded their databases from Postgres 8.2 to Postgres 9.2. As everyone hope after upgrade, website was much faster in response time and they should be happy after upgrading their database to Postgres 9.2. That is the case for them as well Yes, they are happy Postgres user except some of the queries used to run without any issue are causing Out of Memory errors now ! Sometimes, the queries were causing segmentation fault by Signal 11
Weird, right ?
Here is the email received that describes the problem:
We’ve been happy Pg users for years now and have a pretty good command of what’s going on. We recently upgraded to 9.2.x and moved onto new hardware at the same time. Everything’s screaming fast as we’d hoped and working well, but… Now our most-intensive queries are failing with an “out of memorySQL state: 53200″ error. Not in production mind you, these are long-running queries we execute manually against a slave to infrequently do big exports.
On our old Pg version 8.2.x and much skimpier hardware, the job would take forever but complete, which was fine for the purpose. Now it’s on newer software with much more memory and CPU, but failing to complete. It seems to be failing on reports that use temporary tables that weren’t analyzed, and large queries during “hash join” and “merge join” operations. We’ve surely got something configured wrong, but we’ve been banging our heads against the wall and are out of ideas, eg. we’ve tried cranking work_mem way up, disabling hashjoin, no dice.
We requested to have conference call to get more details but we couldn’t able to attend conference call next day because of next day scheduled visit to NYC . When we mentioned that we are visiting NYC tomorrow, they requested us, if we could stop by their Office ! We said yes because their office is on the way to OmniTI’s NYC office
In late evening, a day before visit, we sent out email to get some background and more details about environment so we could come up with some ideas before meeting:
1. OS version : uname -a
2. Postgres Version(please provide output of this query): select version();
3. Total Memory on server : free -m
4. Actual error with query from pg_log. If you could provide log file with contents 1 hour before and after actual error, that will be helpful.
5. postgresql.conf file(s) for pre and post upgrade. Postgrs 8.2 and Postgres 9.2
6. Output of show commands in production : show all;
7. Explain output of query on pg9.2 . If you have explain output from earlier Postgres 8.2 version available, that will be helpful as well.
While riding on the train next day, we received answers on the questions via email.
Excerpts from log file…
2013-10-08 18:24:43 EDT : [4-1] user=XXX,db=dc_query ERROR: out of memory
2013-10-08 18:24:43 EDT : [5-1] user=XXX,db=dc_query DETAIL: Failed on request of size 24.
2013-10-08 19:21:12 EDT : [2527-1] user=,db= LOG: server process (PID 17415) was terminated by signal 11: Segmentation fault
2013-10-08 19:21:12 EDT : [2528-1] user=,db= DETAIL: Failed process was running:
We analyzed answers and it looked like the problem characteristics kind of matches with the similar situation that we came across for other client.
- Upgraded database from Postgres 8.2 to Postgres 9.2
- Query is failing with Out of Memory
- Explain plan is damn big ! The query is doing lots of joins !
The only but big difference was that we did not come across segmentation fault. We discussed about the problem internally in the team. Everyone kind of agree that it’s issue with max_locks_per_transaction. max_locks_per_transaction controls the average number of object locks allocated for each transaction; individual transactions can lock more objects as long as the locks of all transactions fit in the lock table. By default, max_locks_per_transaction is set to 64, which means that Postgres is prepared to track up to ( 64 X number of open transactions) locks. The problem is that Postgres 8.2 with same setting query runs without any issue but query fails with Out of Memory error on Postgres 9.2. Not good, right?
After morning meetings in NYC, we visited DonorsChoose office and as we planned to test the first thing is to change max_locks_per_transaction parameter with higher value and restart database to reload the settings, we just gave it a try on QA database server.
You know what, the change solved the problem !
They couldn’t able to run query since they upgraded to Postgres 9.2, but now they could While running the query, we noticed that the query required around 150 locks but the default value is only 64. The query is joining 50+ tables because it is data-warehosue query and having couple of unions!
You must be thinking that why the query worked on Postgres 8.2 but it doesn’t work on Postgres 9.2 ? My theory behind is that max_locks_per_transaction value wasn’t being honored until the implementation of true serialization isolation level and inclusion of max_pred_locks_per_transaction parameter in Postgres 9.1. If you know any concrete reason, I will be happy to learn more about it.
In any case, if you come across similar problem of “Out of Memory” error after upgrading your database server to Postgres 9, you know the first thing to check is max_locks_per_transaction!