During the life cycle of a 'live' system, things tend to break.
Sometimes because you've missed something, sometimes because you didn't get to test that bit of code. But whenever a system turns live, after a while (Sometimes it happens pretty quick...) the system needs to handle something you haven't tested.
Sometimes even an unexpected failure. Sometimes network lagging causing timing issues.
Few of these times, you encounter what I would like to call a gremlin. That's this illusive bug which impacts your system, occurs enough times a day to be considered as a problem, but not enough so it's easy to reproduce.
The worst thing programmers love to do is mark it as 'irreproducible', throw it back at the QA, and do something else.
Advice for this stage: Don't. QA would probably not be able to reproduce it, and even if they would, it would be difficult for them to do it in a consistent manner so that you'll be able to gain from it. You need to help them. These things escalate fast, so don't put yourself in the position of the developer who tried to put the mess under the rug.
Start digging. And sometimes, the solution requires processing through endless logs (Or even add these logs). Sometimes a lot of code is involved. And after a few hours, if you still haven't found it, you find yourself repeating your actions, adding more logs, running sequences again. You've entered a stage of tunnel vision, where you can't look around.
Best advice for this stage: Stop. Go get coffee, go home, call someone, take a break. My best solutions came from getting some distance from the problem I've been working on, in order to clear my vision and be able to try a new approach. Another fresh set of eyes could work as well.
Happy hunting.