Tuesday, 2 April 2013

Using rsync to analyse hadoop cluster logs at the name node

If you have worked on a hadoop cluster and tried going through the logs on all different cluster nodes you know how painful can it be. This script can be run on the name node and it will copy all the logs for the given hadoop job-id to the current directory of the name node. You will have to change the rsync parameters to suit yourself.

Python script:

import sys
import subprocess
import os

if len(sys.argv) < 2:
        print "Usage: python ",sys.argv[0]," <hadoop-job-id>"
        sys.exit(0)

nodes = ['192.168.112.117', '192.168.156.63', '192.168.152.31', '192.168.112.118', '192.168.156.65', '192.168.156.62' ]

#subprocess.Popen(["ls","-lr"])

for node in nodes:
        subprocess.Popen(["rsync", "-rav", "root@"+node+":/data/hadoop_logs/userlogs/"+sys.argv[1], ".")

#os.system("rsync -rav root@node-0de4bd:/data/hadoop_logs/userlogs/"+sys.argv[1]+" .")

IMP: Rsync requires that you are able to do the password-less access to these nodes from the current node (on which you are running this script.) For help on setting up password-less access check here.

No comments:

Post a Comment