Monday 31 December 2012

Adding GC logs to hadoop child processes and analyzing the GC logs

Add the following to the mapred-site.xml file:

mapred.child.java.opts Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc.

Additional options: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC

Reference: http://hadoop.apache.org/docs/r1.0.0/mapred-default.html


Analyzing GC logs:
-------------------------

Meaning of the [GC [PSYoungGen: 230400K->19135K(268800K)] line is:

  • Around 256MB (268800K) is the Young Generation Size, 
  • Before Garbage Collection in young generation the heap utilization in Young Generation area was around 255MB (230400K) and 
  • After garbage collection it reduced up to 18MB (19135K).


Reference: http://middlewaremagic.com/weblogic/?p=5131
https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=22d56091-3a7b-4497-b36e-634b51838e11

Sunday 30 December 2012

Playing around with Solr

The following article shows how to add a custom field (for ISBN no.s) to be indexed by Solr:

http://robotlibrarian.billdueber.com/solr-field-type-for-numericish-ids/

Would love to write a custom one myself soon !

Thursday 27 December 2012

Inspiring article on Big Data

http://gigaom.com/data/how-disney-built-a-big-data-platform-on-a-startup-budget/

"By the sheer power of its will (and ingenuity), a small team has been able to craft a large custom platform out of Hadoop, NoSQL databases and other open-source technologies. "

Saturday 22 December 2012

How to download the historical versions of pages from google cache

Well you can get the historical versions of web pages from the google cache. I am not sure how old are these, if any one of you knows and cares to look it up please let me know too :P ?

The query you need to use to get an older version of webpage www.mypage.com from Google Cache is as follows:
http://webcache.googleusercontent.com/search?q=cache:www.mypage.com

To download the page using the wget linux command, you need to pass user-agent as google doesn't allow wget to fetch data, so here is how you can automate it:

wget --output-document=out.htm --user-agent=AGENT --level=1 http://webcache.googleusercontent.com/search?q=cache:www.mypage.com

Interesting isn't it :)

Tuesday 18 December 2012

Using awk to find the sum of a CSV file after group by on a particular column

http://www.theunixschool.com/2012/06/awk-10-examples-to-group-data-in-csv-or.html

I have a CSV file like :


65523 , 100
65522 , 900
65522 , 1800
65522 , 100
65522 , 100
65521 , 500
65521 , 200
65521 , 200

I need to find the sum of the 2nd column based on the grouping by the 1st column, so that the output looks something like:


65523 , 100
65522 , 2900
65521 , 900

SOLUTION:

This can be easily achieved using a single line awk script:

awk -F"," '{a[$1]+=$2;}END{for (i in a)print i, a[i];}' file

Awesome isn't it !! :)




Sorting a CSV file based on a particular column list

http://stackoverflow.com/questions/9471101/sort-csv-file-by-column-priority-using-the-sort-command-unix

sort --field-separator=';' --key=2,1
sort -nr -t',' -k3


To sort based on multiple columns use the syntax:

sort --key=1,1 --key=2,2r --key=3,3 --key=4,4r
sort -k1,1 -k2,2r -k3,3 -k4,4r
as in the following transcript:
pax$ echo '5 3 2 9
3 4 1 7
5 2 3 1
6 1 3 6
1 2 4 5
3 1 2 3
5 2 2 3' | sort --key=1,1 --key=2,2r --key=3,3 --key=4,4r

1 2 4 5
3 4 1 7
3 1 2 3
5 3 2 9
5 2 2 3
5 2 3 1
6 1 3 6
Remember to provide the -n option if you want them treated as proper numbers (variable length), such as:
sort -n -k1,1 -k2,2r -k3,3 -k4,4r


Learning to use Screen command in Unix



If your local computer crashes, or you are connected via a modem and lose the connection, the processes or login sessions you establish through screen don't go away. You can resume your screen sessions with the following command: screen -r

screen                                ==>    Start a new screen
(Ctrl+A) & C                    ==>    Start a new screen sub-window
(Ctrl+A) & K                    ==>    Kill the current sub-window
(Ctrl+A) & (Shift + ")       ==>    Show the list of screens running on the system
screen -r                            ==>    restore to the old screens
screen -ls                           ==>    list of running screens
(Ctrl+A) & (Shift + A)      ==>   rename the current screen

Sunday 16 December 2012

Setting up SVN on AWS EC2 instance

Steps to setup the SVN repository on the cloud instance (http://www.ange-agostini.com/blog/it/5-minutes-to-set-up-a-subversion-server-in-the-cloud.html):
Install subversion, apache and mod_dav_svn:
# sudo yum install mod_dav_svn subversion
Edit the Apache configuration file for subversion:
# sudo vi /etc/httpd/conf.d/subversion.conf
Replace subversion.conf content by:

LoadModule dav_svn_module     modules/mod_dav_svn.so
LoadModule authz_svn_module   modules/mod_authz_svn.so
<Location /repos>
   DAV svn
   SVNParentPath /var/www/svn
   # Limit write permission to list of valid users.
   AuthType Basic
   AuthName "Authorization Realm"
   AuthUserFile /var/www/svn-auth/passwd
   AuthzSVNAccessFile  /var/www/svn-auth/access
   Require valid-user
</Location>

Create the directory which will contain the subversion repository:
# sudo mkdir /var/www/svn
Create the directory which will contain the permissions files.
 # sudo mkdir /var/www/svn-auth
Create the permission file:
# sudo vi /var/www/svn-auth/access
[/]
<theUser> = rw
Note: Replace <theUser> by the login you want to use to access your repository.
<theUser> will have read write access to all repositories.
It is possible to setup authorization by group or user for each repository.
Create the password file:
# sudo htpasswd -cb /var/www/svn-auth/passwd <theUser> <thePassword>
Note: Replace <theUser> by the login you want to use to access your repository. Replace <thePassword> by the password you want.
Create a repository (here project1):
# cd /var/www/svn
# sudo svnadmin create project1
Change files authorization:
# sudo chown -R apache.apache /var/www/svn /var/wwws/vn-auth
# chmod 600 /var/wwws/vn-auth/access /var/www/svn-auth/passwd
Start apache web server:
# sudo service httpd start
Note: to restart server use # sudo service httpd restart

Test subversion

Now subversion and apache should work.
Open a web browser and point to the URL : http://<Public DNS of your EC2 instance>/repos/project1
You should be prompted for your credential (Enter <theUser> <thePassword>) before accessing the repository

Subversion client

We are now going to interact with our repository from a windows PC.
If you don’t have a subversion client installed on your PC then you can install one from http://www.sliksvn.com/en/download .
You can test your subversion client from your PC by listing files on your repository:
svn ls http://<Public DNS of your EC2 instance>/repos/project1
The first time we often want to import some files to the repository:
svn import -m "Initial import." <path of the reposity where are the files on your PC> http://<Public DNS of your EC2 instance>/repos/project1


---------------------------------------------------------------------------------------------------------------------------------
ANOTHER WAY:

Steps given at https://forums.aws.amazon.com/thread.jspa?messageID=209468:

I signed up for an EC2 free tier tonight intending to use it as a Subversion server. The whole process took a couple of hours to setup, but can be compressed down to just a few minutes. If anybody else is looking to do the same, here is what I did:

1) Create the Linux-flavored micro instance.
2) Give it a Security Group that opens port 3690 to the sources of your choice. The following example allows SVN access from all Internet sources:
  • | tcp | 3690 | 3690 | 0.0.0.0/0
  • | udp | 3690 | 3690 | 0.0.0.0/0
3) SSH into the micro instance and run the command "sudo yum install subversion".
4) Create the directory to be used for subversion, for example "/home/ec2-user/svn".
5) Enter "svnadmin create /home/ec2-user/svn" (or whichever path you specified).
6) Edit two files: svnserve.conf and passwd both found in the "conf" directory of your newly created repository source.
In svnserve.conf, unremark the lines for "anon-access" (but change from "read" to "none") and "auth-access". Additionally, unremark the "password-db = passwd" line, but don't change it, and unremark the "realm =" line, providing a realm name of your choice.
In the passwd file, simply add a line entry with the user name(s) and password(s) to be used for access.
7) Finally, run "svnserve -d -r /home/ec2-user/svn" (or again, whichever path you specified) to kick off the instance.

Your Subversion server should now be available at the URL: svn://<yourinstancehostorip>/

Keep in mind the security for this quick and dirty setup is very minimal, and the daemon has not been configured for automatic startup at this point.

Reading:

http://jonathanhui.com/install-configure-subversion-ec2-amazon-linux

Saturday 15 December 2012

Setting up Nutch on MAC - issues [Solved]

I setup nutch following the official tutorial : http://wiki.apache.org/nutch/NutchTutorial

However, I am facing this error when i try to run the command "bin/nutch crawl urls -dir crawl -depth 3 -topN 5" :


2012-12-15 14:43:13.028 java[82636:c07] *** Terminating app due to uncaught exception 'JavaNativeException', reason: 'KrbException: Could not load configuration from SCDynamicStore'
*** First throw call stack:
(
0   CoreFoundation                      0x00007fff8b94cfc6 __exceptionPreprocess + 198
1   libobjc.A.dylib                     0x00007fff87621d5e objc_exception_throw + 43
2   CoreFoundation                      0x00007fff8b9d72a9 -[NSException raise] + 9
3   JavaNativeFoundation                0x0000000108aa1c3f JNFCallStaticVoidMethod + 213
4   libjava.jnilib                      0x0000000108ac1169 Java_sun_security_krb5_SCDynamicStoreConfig_installNotificationCallback + 450
5   JavaNativeFoundation                0x0000000108aa4182 JNFPerformEnvBlock + 86
6   SystemConfiguration                 0x00007fff8703d3b8 rlsPerform + 119
7   CoreFoundation                      0x00007fff8b8bb6e1 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
8   CoreFoundation                      0x00007fff8b8baf4d __CFRunLoopDoSources0 + 253
9   CoreFoundation                      0x00007fff8b8e1d39 __CFRunLoopRun + 905
10  CoreFoundation                      0x00007fff8b8e1676 CFRunLoopRunSpecific + 230
11  java                                0x00000001081c0843 java + 18499
12  java                                0x00000001081c029a java + 17050
13  java                                0x00000001081bda98 java + 6808
)
terminate called throwing an exceptionAbort trap: 6

About  SCDynamicStore : http://developer.apple.com/library/mac/#documentation/Networking/Reference/SCDynamicStore/Reference/reference.html

SOLUTION:

Put the following line in the bin/nutch file :

NUTCH_OPTS="$NUTCH_OPTS -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

based on https://issues.apache.org/jira/browse/HADOOP-7489