Data Collection for Root Cause Analysis
When troubleshooting a Production Down issue, if the administrator would like to investigate the root cause of the issue, it is important to collect information before the environment is restarted.
Even when all this information is available, a root cause may not be able to be determined.
On behalf you may create one of the necessary commands via
alias dmpid='ps -ef | grep -i "java.*jetty.*datameer" | grep -v grep | tr -s " " | cut -d " " -f2'
Items to Collect before Restarting Datameer
- Get all! logs from the
logs/directory that were updated in the past day or two at least, especially the application log file (aka
tar -zcvf rca.tar.gz logs/*.*
- To list and collect the open files (
lsof), run on the Datameer environment
lsof > lsof.out
- Additional gather the network connections
netstat -tonp | grep -i WAIT > netstat.out
- Force a heap dump
$JAVA_HOME/bin/jmap -F -dump:format=b,file=heapdump.hprof $(dmpid)
- If the
/devpage is accessible, collect a Thread Dump (i.e. http(s)://<host>:<port>/dev/threaddump)
- Collect also a Java thread dump, if possible.
$JAVA_HOME/bin/jstack -l $(dmpid) > jstack.out kill -3 `dmpid`
- If a heapdump exists from an
OutOfMemoryException, collect the heapdump file
- If a
javacorefile exists, collect the
- Gather an application database dump, if possible.
mysqldump -h<host> -udap -pdap dap | gzip > Datameer-<version>-<dist>-<date>.sql.gz
Even when all the information above is collected, the root cause itself may require further troubleshooting to diagnose. If an issue is recurring, Datameer Support may recommend activating Memory Profiling from our Documentation. This is not recommended generally and should only be activated at the request of Datameer Support.