Monday, 31 December 2012

Adding GC logs to hadoop child processes and analyzing the GC logs

Add the following to the mapred-site.xml file: Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc.

Additional options: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC


Analyzing GC logs:

Meaning of the [GC [PSYoungGen: 230400K->19135K(268800K)] line is:

  • Around 256MB (268800K) is the Young Generation Size, 
  • Before Garbage Collection in young generation the heap utilization in Young Generation area was around 255MB (230400K) and 
  • After garbage collection it reduced up to 18MB (19135K).


Sunday, 30 December 2012

Playing around with Solr

The following article shows how to add a custom field (for ISBN no.s) to be indexed by Solr:

Would love to write a custom one myself soon !

Thursday, 27 December 2012

Inspiring article on Big Data

"By the sheer power of its will (and ingenuity), a small team has been able to craft a large custom platform out of Hadoop, NoSQL databases and other open-source technologies. "

Saturday, 22 December 2012

How to download the historical versions of pages from google cache

Well you can get the historical versions of web pages from the google cache. I am not sure how old are these, if any one of you knows and cares to look it up please let me know too :P ?

The query you need to use to get an older version of webpage from Google Cache is as follows:

To download the page using the wget linux command, you need to pass user-agent as google doesn't allow wget to fetch data, so here is how you can automate it:

wget --output-document=out.htm --user-agent=AGENT --level=1

Interesting isn't it :)

Tuesday, 18 December 2012

Using awk to find the sum of a CSV file after group by on a particular column

I have a CSV file like :

65523 , 100
65522 , 900
65522 , 1800
65522 , 100
65522 , 100
65521 , 500
65521 , 200
65521 , 200

I need to find the sum of the 2nd column based on the grouping by the 1st column, so that the output looks something like:

65523 , 100
65522 , 2900
65521 , 900


This can be easily achieved using a single line awk script:

awk -F"," '{a[$1]+=$2;}END{for (i in a)print i, a[i];}' file

Awesome isn't it !! :)

Sorting a CSV file based on a particular column list

sort --field-separator=';' --key=2,1
sort -nr -t',' -k3

To sort based on multiple columns use the syntax:

sort --key=1,1 --key=2,2r --key=3,3 --key=4,4r
sort -k1,1 -k2,2r -k3,3 -k4,4r
as in the following transcript:
pax$ echo '5 3 2 9
3 4 1 7
5 2 3 1
6 1 3 6
1 2 4 5
3 1 2 3
5 2 2 3' | sort --key=1,1 --key=2,2r --key=3,3 --key=4,4r

1 2 4 5
3 4 1 7
3 1 2 3
5 3 2 9
5 2 2 3
5 2 3 1
6 1 3 6
Remember to provide the -n option if you want them treated as proper numbers (variable length), such as:
sort -n -k1,1 -k2,2r -k3,3 -k4,4r

Learning to use Screen command in Unix

If your local computer crashes, or you are connected via a modem and lose the connection, the processes or login sessions you establish through screen don't go away. You can resume your screen sessions with the following command: screen -r

screen                                ==>    Start a new screen
(Ctrl+A) & C                    ==>    Start a new screen sub-window
(Ctrl+A) & K                    ==>    Kill the current sub-window
(Ctrl+A) & (Shift + ")       ==>    Show the list of screens running on the system
screen -r                            ==>    restore to the old screens
screen -ls                           ==>    list of running screens
(Ctrl+A) & (Shift + A)      ==>   rename the current screen

Sunday, 16 December 2012

Setting up SVN on AWS EC2 instance

Steps to setup the SVN repository on the cloud instance (
Install subversion, apache and mod_dav_svn:
# sudo yum install mod_dav_svn subversion
Edit the Apache configuration file for subversion:
# sudo vi /etc/httpd/conf.d/subversion.conf
Replace subversion.conf content by:

LoadModule dav_svn_module     modules/
LoadModule authz_svn_module   modules/
<Location /repos>
   DAV svn
   SVNParentPath /var/www/svn
   # Limit write permission to list of valid users.
   AuthType Basic
   AuthName "Authorization Realm"
   AuthUserFile /var/www/svn-auth/passwd
   AuthzSVNAccessFile  /var/www/svn-auth/access
   Require valid-user

Create the directory which will contain the subversion repository:
# sudo mkdir /var/www/svn
Create the directory which will contain the permissions files.
 # sudo mkdir /var/www/svn-auth
Create the permission file:
# sudo vi /var/www/svn-auth/access
<theUser> = rw
Note: Replace <theUser> by the login you want to use to access your repository.
<theUser> will have read write access to all repositories.
It is possible to setup authorization by group or user for each repository.
Create the password file:
# sudo htpasswd -cb /var/www/svn-auth/passwd <theUser> <thePassword>
Note: Replace <theUser> by the login you want to use to access your repository. Replace <thePassword> by the password you want.
Create a repository (here project1):
# cd /var/www/svn
# sudo svnadmin create project1
Change files authorization:
# sudo chown -R apache.apache /var/www/svn /var/wwws/vn-auth
# chmod 600 /var/wwws/vn-auth/access /var/www/svn-auth/passwd
Start apache web server:
# sudo service httpd start
Note: to restart server use # sudo service httpd restart

Test subversion

Now subversion and apache should work.
Open a web browser and point to the URL : http://<Public DNS of your EC2 instance>/repos/project1
You should be prompted for your credential (Enter <theUser> <thePassword>) before accessing the repository

Subversion client

We are now going to interact with our repository from a windows PC.
If you don’t have a subversion client installed on your PC then you can install one from .
You can test your subversion client from your PC by listing files on your repository:
svn ls http://<Public DNS of your EC2 instance>/repos/project1
The first time we often want to import some files to the repository:
svn import -m "Initial import." <path of the reposity where are the files on your PC> http://<Public DNS of your EC2 instance>/repos/project1


Steps given at

I signed up for an EC2 free tier tonight intending to use it as a Subversion server. The whole process took a couple of hours to setup, but can be compressed down to just a few minutes. If anybody else is looking to do the same, here is what I did:

1) Create the Linux-flavored micro instance.
2) Give it a Security Group that opens port 3690 to the sources of your choice. The following example allows SVN access from all Internet sources:
  • | tcp | 3690 | 3690 |
  • | udp | 3690 | 3690 |
3) SSH into the micro instance and run the command "sudo yum install subversion".
4) Create the directory to be used for subversion, for example "/home/ec2-user/svn".
5) Enter "svnadmin create /home/ec2-user/svn" (or whichever path you specified).
6) Edit two files: svnserve.conf and passwd both found in the "conf" directory of your newly created repository source.
In svnserve.conf, unremark the lines for "anon-access" (but change from "read" to "none") and "auth-access". Additionally, unremark the "password-db = passwd" line, but don't change it, and unremark the "realm =" line, providing a realm name of your choice.
In the passwd file, simply add a line entry with the user name(s) and password(s) to be used for access.
7) Finally, run "svnserve -d -r /home/ec2-user/svn" (or again, whichever path you specified) to kick off the instance.

Your Subversion server should now be available at the URL: svn://<yourinstancehostorip>/

Keep in mind the security for this quick and dirty setup is very minimal, and the daemon has not been configured for automatic startup at this point.


Saturday, 15 December 2012

Setting up Nutch on MAC - issues [Solved]

I setup nutch following the official tutorial :

However, I am facing this error when i try to run the command "bin/nutch crawl urls -dir crawl -depth 3 -topN 5" :

2012-12-15 14:43:13.028 java[82636:c07] *** Terminating app due to uncaught exception 'JavaNativeException', reason: 'KrbException: Could not load configuration from SCDynamicStore'
*** First throw call stack:
0   CoreFoundation                      0x00007fff8b94cfc6 __exceptionPreprocess + 198
1   libobjc.A.dylib                     0x00007fff87621d5e objc_exception_throw + 43
2   CoreFoundation                      0x00007fff8b9d72a9 -[NSException raise] + 9
3   JavaNativeFoundation                0x0000000108aa1c3f JNFCallStaticVoidMethod + 213
4   libjava.jnilib                      0x0000000108ac1169 Java_sun_security_krb5_SCDynamicStoreConfig_installNotificationCallback + 450
5   JavaNativeFoundation                0x0000000108aa4182 JNFPerformEnvBlock + 86
6   SystemConfiguration                 0x00007fff8703d3b8 rlsPerform + 119
7   CoreFoundation                      0x00007fff8b8bb6e1 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
8   CoreFoundation                      0x00007fff8b8baf4d __CFRunLoopDoSources0 + 253
9   CoreFoundation                      0x00007fff8b8e1d39 __CFRunLoopRun + 905
10  CoreFoundation                      0x00007fff8b8e1676 CFRunLoopRunSpecific + 230
11  java                                0x00000001081c0843 java + 18499
12  java                                0x00000001081c029a java + 17050
13  java                                0x00000001081bda98 java + 6808
terminate called throwing an exceptionAbort trap: 6

About  SCDynamicStore :


Put the following line in the bin/nutch file :


based on