Out Of Memory

From UFRC
Jump to navigation Jump to search

When a job is terminated because an application process you were running in a job attempted to use more memory than you requested for the job you will see one or more of the 'OOM', 'oom-kill event(s)', 'out of memory' error(s) in the job log

To resolve the issue you have the following choices and considerations:

  • Is the goal to finish the analysis with the same dataset, application, and parameters as soon as possible?

If yes, then you need to request more memory when resubmitting the job. Keep increasing the memory request and resubmitting the job until the analysis completes. One rule of thumb some people use is to double the amount of requested memory and then tighten the memory request on subsequent runs. That can be done based on what you see in jobhtop interface or by running a memory use profiler (top -bp PID, remora, nsight, ddt or some other debugger/profiler).

Use the bigmem partition once 1TB memory request boundary is crossed as our regular nodes have 1 TB of RAM. Use the burst qos to gain access to more memory for the job if out of resources in the investment qos and as your HPG group's sponsor to purchase more resources if neither is sufficient. There is a 1-month investment term under which 1 NCU (1 cpu core + 7.8 gb of memory) costs only $3.67 for the month. Sometimes it is far cheaper to throw more resources at a problem than spend time to look for a different solution.

If an application uses more memory than expected from this application check with the developers. It is possible that a particular release has a memory leak or other bug that causes this issue. Look for the github repository, email list/forum, or contact customer support if application is commercial.

  • If the goal to finish the analysis with the same application, but you can either change the parameters (arguments) the application runs with or split your dataset into smaller subsets to allow the analyses to use less memory then you should consider those options. Being able to split datasets is a bit of a magic button resource use-wise if the analysis is amenable to that.
  • Is the goal to finish the analysis, but you don't necessarily have to use the same application you used for other datasets or you prefer to rerun analyses on other datasets to have a single reproducible workflow? Look at other applications that use less memory. Most workflows have several application options to achieve the same result.