Sunday, 3 February 2013

Difference between ClassNotFoundException and NoClassDefFoundError

http://stackoverflow.com/questions/1457863/what-is-the-difference-between-noclassdeffounderror-and-classnotfoundexception


The difference from the Java API Specifications is as follows.
Thrown when an application tries to load in a class through its string name using:
  • The forName method in class Class.
  • The findSystemClass method in class ClassLoader.
  • The loadClass method in class ClassLoader.
but no definition for the class with the specified name could be found.
Thrown if the Java Virtual Machine or a ClassLoader instance tries to load in the definition of a class (as part of a normal method call or as part of creating a new instance using the new expression) and no definition of the class could be found.
The searched-for class definition existed when the currently executing class was compiled, but the definition can no longer be found.
So, it appears that the NoClassDefFoundError occurs when the source was successfully compiled, but at runtime, the required class files were not found. This may be something that can happen in the distribution or production of JAR files, where not all the required class files were included.
As for ClassNotFoundException, it appears that it may stem from trying to make reflective calls to classes at runtime, but the classes the program is trying to call is does not exist.
The difference between the two is that one is an Error and the other is an Exception. WithNoClassDefFoundError is an Error and it arises from the Java Virtual Machine having problems finding a class it expected to find. A program that was expected to work at compile-time can't run because of class files not being found, or is not the same as was produced or encountered at compile-time. This is a pretty critical error, as the program cannot be initiated by the JVM.
On the other hand, the ClassNotFoundException is an Exception, so it is somewhat expected, and is something that is recoverable. Using reflection is can be error-prone (as there is some expectations that things may not go as expected. There is no compile-time check to see that all the required classes exist, so any problems with finding the desired classes will appear at runtime.



Thursday, 31 January 2013

How to find the process which has the file handle open ?

A lot of times you need to find which Process (PID) has the handle to a particular file.
In linux doing this easy using the following command:

$ ls -ld1 /proc/*/fd/* | grep <file-name-filter>

Enjoy !!

Monday, 28 January 2013

Bloom filters and Count min Sketching Data structures

Another Good read:
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Original Post: http://lkozma.net/blog/sketching-data-structures/#comment-85892


“Sketching” data structures store a summary of a data set in situations where the whole data would be prohibitively costly to store (at least in a fast-access place like the memory as opposed to the hard disk). Variants of trees, hash tables, etc. are not sketching structures, they just facilitate access to the data, but they still store the data itself. However, the concept of hashing is closely related to most sketching ideas as we will see.
The main feature of sketching data structures is that they can answer certain questions about the data extremely efficiently, at the price of the occasional error. The best part is that the probability of an error can be quantified and the programmer can trade off the expected error rate with the amount of resources (storage, time) afforded. At the limit of this trade-off (when no error is allowed) these sketching structures collapse into traditional data structures.
Sketching data structures are somewhat counter-intuitive, but they can be useful in many real applications. I look at two such structures mostly for my own benefit: As I try to understand them, I write down my notes. Perhaps someone else will find them useful. Links to further information can be found in the end. Leave comments if you know of other sketching data structures that you found useful or if you have some favorite elegant and unusual data structure.

1. Bloom filter

Suppose we have to store a set of values (A) that come from a “universe” of possible values (U). Examples: IP addresses, words, names of people, etc. Then we need to check whether a new item xis a member of set A or not. For example, we might check if a word is spelled correctly by looking it up in the dictionary, or we can verify whether an IP address is banned by looking it up in our black list.
We could achieve this by storing the whole set in our favorite data structure. Alternatively, we could just store a binary array, with one bit for each possible element in U. For example, to quickly check if a number is prime or not, we could precompute an array of bits for all numbers from 0 to the maximum value we need:
Prime = 001101010001010001010001...
To check whether a number is prime, we look at the corresponding bit and we are done. This is a dummy example, but it is already obvious that in most cases the range of possible values is too large to make this practical. The number of all possible strings of length 5, containing just letters from the English alphabet is 26^5 = 11,881,376 and in most real problems the universe U is much larger than that.
The magic of the Bloom filter allows us to get away with much less storage at the price of an occasional mistake. This mistake can only be a false positive, the Bloom filter might say that x is in A when in fact it is not. On the other hand, when it says that x is not in A, this is always true, in other words false negatives are impossible. In some applications (like the spell-checker), this is acceptable if false positives are not too frequent. In other applications (like the IP blacklist), misses are more common and in the case of a hit, we can verify the answer by reading the actual data from the more costly storage. In this case the Bloom filter can act as an efficiency layer in front of a more costly storage structure. If false positives can be tolerated, the Bloom filter can be used by itself.
The way it works is really simple: we use a binary array of size n, as in the prime numbers example, that is initialized with 0s. In this case however, n is much smaller than the total number of elements in U. For each element to be added to A, we compute different hash values (using k independent hash functions) with results between 1 and n. We set all these locations h1, h2, …, hk (the indexes returned by the hash functions) in the binary array to 1. To check if y is in A, we compute the hash values h1(y), …, hk(y) and check the corresponding locations in the array. If at least one of them is 0, the element is missing. If all fields are 1, we can say that the element is present with a certainty that depends on n (the size of the array), k (the number of hashes) and the number of elements inserted. Note that n and k can be set beforehand by the programmer.
The source of this uncertainty is that hash values can collide. This becomes more of a problem as the array is filling up. If the array were full, the answer to all queries would be a yes. In this simple variant, deleting an element is not possible: we cannot just set the corresponding fields to 0, as this might interfere with other elements that were stored. There are many variants of Bloom filters, some allowing deletion and some allowing the storage of a few bits of data as well. For these and for some rigorous analysis, as well as some implementation tricks, see the links below.
A quick dummy example is a name database. Suppose we want to store female names and reject male names. We use two hash functions that return a number from 1 to 10 for any string.
Initial configuration: 0000000000
Insert("Sally")      : 0100000001
   # h1("Sally") = 2, h2("Sally") = 10
Insert("Jane")       : 1110000001
   # h1("Jane") = 1, h2("Jane") = 3
Insert("Mary")       : 1110100001
   # h1("Mary") = 5, h2("Mary") = 2 [collision]

Query("Sally")
   # bits 2 and 10 are set,
   # return HIT
Query("John")
   # h1("John") = 10 set, but h2("John") = 4 not set
   # return MISS
Query("Bob")
   # h1("Bob") = 5 set, h2("Bob") = 1 set
   # return HIT (false positive)
The original paper (Bloom, 1970)

2. Count-Min sketch

The Count-Min (CM) sketch is less known than the Bloom filter, but it is somewhat similar (especially to the counting variants of the Bloom filter). The problem here is to store a numerical value associated with each element, say the number of occurrences of the element in a stream (for example when counting accesses from different IP addresses to a server). Surprisingly, this can be done using less space than the number of elements, with the trade-off that the result can be slightly off sometimes, but mostly on the small values. Again, the parameters of the data structure can be chosen such as to obtain a desired accuracy.
CM works as follows: we have k different hash functions and kdifferent tables which are indexed by the outputs of these functions (note that the Bloom filter can be implemented in this way as well). The fields in the tables are now integer values. Initially we have all fields set to 0 (all unseen elements have count 0). When we increase the count of an element, we increment all the corresponding k fields in the different tables (given by the hash values of the element). If a decrease operation is allowed (which makes things more difficult), we similarly subtract a value from all k elements.
To obtain the count of an element, we take the minimum of the kfields that correspond to that element (as given by the hashes). This makes intuitive sense. Out of the k values, probably some have been incremented on other elements also (if there were collisions on the hash values). However, if not all k fields have been returned by the hash functions on other elements, the minimum will give the correct value. See illustration for an example on counting hits from IP addresses:
In this example the scenario could be that we want to notice if an IP address is responsible for a lot of traffic (to further investigate if there is a problem or some kind of attack). The CM structure allows us to do this without storing a record for each address. When we increment the fields corresponding to an address, simultaneously we check if the minimum is above some threshold and we do some costly operation if it is (which might be a false alert). On the other hand, the real count can never be larger than the reported number, so if the minimum is a small number, we don’t have to do anything (this holds for the presented simple variant that does not allow decreases). As the example shows, CM sketch is most useful for detecting “heavy hitters” in a stream.
It is interesting to note that if we take the CM data structure and make the counters such that they saturate at 1, we obtain the Bloom filter.
For further study, analysis of the data structure and variants, proper choice of parameters, see the following links:
Original paper (Cormode, Muthukrishnan, 2005)
Lecture slides (Helsinki Univ. of Tech)
What is your favorite counter-intuitive data structure or algorithm?

Stealing browser history

Thursday, 24 January 2013

SPOJ problem ID [4]: Transform the Expression

Following is my first solution to the SPOJ problem 4: http://www.spoj.com/problems/ONP/

Python solution:


import sys
import string

operators = ['+', '-', '*', '/', '^']

n = int(raw_input())

def isoperator(char):
        if char in operators:
                return True
        else:
                return False

def process_line(line):
        stck = []
        output = ""
        for i in range(0,len(line)):
                if line[i].isalpha():
                        output += line[i]
                else:
                        if isoperator(line[i]):
                                while len(stck) >0 and stck[-1] in operators and operators.index(line[i]) < operators.index(stck[-1]): #lower precendence of new operator, so pop the higher precendence from stck
                                        output+=stck.pop()
                                stck.append(line[i])
                        elif line[i] == ')': #is a closing brace
                                while stck[-1] != '(' :
                                        output+=stck.pop()
                                stck.pop()
                        else : # opening brace, and put it to stck
                                stck.append(line[i])
                #print "stck     :" + str(stck)
                #print "output   :" + output
        while len(stck) > 0:
                output+=stck.pop()
        print output

while n > 0:
        line = raw_input()
        n-=1
        process_line(line)

Sunday, 20 January 2013

Learning Python and Django

Print python module path


>>> import os
>>> print os.__file__
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.pyc


>>> import django
>>> print django.__file__
/Library/Python/2.7/site-packages/django/__init__.pyc

Creating my first Django project
-------------------------------------------
1. django-admin.py startproject mysite


This creates the following structure:

mysite/
    manage.py
    mysite/
        __init__.py
        settings.py
        urls.py
        wsgi.py


These files are:
  • The outer mysite/ directory is just a container for your project. Its name doesn't matter to Django; you can rename it to anything you like.
  • manage.py: A command-line utility that lets you interact with this Django project in various ways. You can read all the details about manage.py in django-admin.py and manage.py.
  • The inner mysite/ directory is the actual Python package for your project. Its name is the Python package name you'll need to use to import anything inside it (e.g. import mysite.settings).
  • mysite/__init__.py: An empty file that tells Python that this directory should be considered a Python package. (Read more about packages in the official Python docs if you're a Python beginner.)
  • mysite/settings.py: Settings/configuration for this Django project. Django settings will tell you all about how settings work.
  • mysite/urls.py: The URL declarations for this Django project; a "table of contents" of your Django-powered site. You can read more about URLs in URL dispatcher.
  • mysite/wsgi.py: An entry-point for WSGI-compatible webservers to serve your project. See How to deploy with WSGI for more details.


The development server

Let's verify this worked. Change into the outer mysite directory, if you haven't already, and run the commandpython manage.py runserver


2. Creating an app for polls under this project


python manage.py startapp polls



That'll create a directory polls, which is laid out like this:
polls/
    __init__.py
    models.py
    tests.py
    views.py
This directory structure will house the poll application.
Edit the polls/models.py file.

But first we need to tell our project that the polls app is installed.
Edit the settings.py file again, and change the INSTALLED_APPS setting to include the string 'polls'
Now Django knows to include the polls app. Let's run another command:
python manage.py sql polls
The sql command doesn't actually run the SQL in your database - it just prints it to the screen so that you can see what SQL Django thinks is required. If you wanted to, you could copy and paste this SQL into your database prompt. However, as we will see shortly, Django provides an easier way of committing the SQL to the database.

If you're interested, also run the following commands:
Looking at the output of those commands can help you understand what's actually happening under the hood.






-

Questions:
-------------
What are the common practices used by python based websites to improve performance ?

Saturday, 19 January 2013

A closer look at Hadoop, Yahoo! developer tutorial

http://developer.yahoo.com/hadoop/tutorial/module4.html


CHAINING JOBS

You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job. Call the first driver method, which uses JobClient.runJob() to run the job and wait for it to complete. When that job has completed, then call the next driver method, which creates a new JobConf object referring to different instances of Mapper and Reducer, etc. The first job in the chain should write its output to a path which is then used as the input path for the second job. This process can be repeated for as many jobs are necessary to arrive at a complete solution to the problem.

Hadoop provides another mechanism for managing batches of jobs with dependencies between jobs. Rather than submit a JobConf to the JobClient's runJob() or submitJob() methods,org.apache.hadoop.mapred.jobcontrol.Job objects can be created to represent each job; A Jobtakes a JobConf object as its constructor argument. Jobs can depend on one another through the use of theaddDependingJob() method. 

The JobControl interface allows you to query it to retrieve the state of individual jobs, as well as the list of jobs waiting, ready, running, and finished. The job submission process does not begin until the run() method of the JobControl object is called.

DEBUGGING MAPREDUCE

 Hadoop keeps logs of important events during program execution. By default, these are stored in the logs/ subdirectory of the hadoop-version/ directory where you run Hadoop from. Log files are named hadoop-username-service-hostname.log. The most recent data is in the .log file; older logs have their date appended to them.

The service name refers to which of the several Hadoop programs are writing the log; these can be jobtracker, namenode, datanode, secondarynamenode, or tasktracker. All of these are important for debugging a whole Hadoop installation. But for individual programs, the tasktracker logs will be the most relevant. Any exceptions thrown by your program will be recorded in the tasktracker logs.

Debugging in the distributed setting is complicated and requires logging into several machines to access log data. If possible, programs should be unit tested by running Hadoop locally. 

LISTING AND KILLING JOBS



$ bin/hadoop job -list
bin/hadoop job -kill jobid

HADOOP STREAMING

Whereas Pipes is an API that provides close coupling between C++ application code and Hadoop, Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations.
Hadoop Streaming allows you to use arbitrary programs for the Mapper and Reducer phases of a MapReduce job. Both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.
Input and output are always represented textually in Streaming. The input (key, value) pairs are written to stdin for a Mapper or Reducer, with a 'tab' character separating the key from the value. The Streaming programs should split the input on the first tab character on the line to recover the key and the value. Streaming programs write their output to stdout in the same format: key \t value \n.
The inputs to the reducer are sorted so that while each line contains only a single (key, value) pair, all the values for the same key are adjacent to one another.
Provided it can handle its input in the text format described above, any Linux program or tool can be used as the mapper or reducer in Streaming. You can also write your own scripts in bash, python, perl, or another language of your choice, provided that the necessary interpreter is present on all nodes in your cluster.

The command as shown, with no arguments, will print some usage information. An example of how to run real commands is given below:
$ bin/hadoop jar contrib/streaming-hadoop-0.18.0-streaming.jar -mapper \
    myMapProgram -reducer myReduceProgram -input /some/dfs/path \
    -output /some/other/dfs/path
This assumes that myMapProgram and myReduceProgram are present on all nodes in the system ahead of time. If this is not the case, but they are present on the node launching the job, then they can be "shipped" to the other nodes with the -file option:
$ bin/hadoop jar contrib/streaming-hadoop-0.18.0-streaming.jar -mapper \
    myMapProgram -reducer myReduceProgram -file \
    myMapProgram -file myReduceProgram -input some/dfs/path \
    -output some/other/dfs/path
Any other support files necessary to run your program can be shipped in this manner as well.



Monday, 31 December 2012

Adding GC logs to hadoop child processes and analyzing the GC logs

Add the following to the mapred-site.xml file:

mapred.child.java.opts Java opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@taskid@.gc.

Additional options: -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC

Reference: http://hadoop.apache.org/docs/r1.0.0/mapred-default.html


Analyzing GC logs:
-------------------------

Meaning of the [GC [PSYoungGen: 230400K->19135K(268800K)] line is:

  • Around 256MB (268800K) is the Young Generation Size, 
  • Before Garbage Collection in young generation the heap utilization in Young Generation area was around 255MB (230400K) and 
  • After garbage collection it reduced up to 18MB (19135K).


Reference: http://middlewaremagic.com/weblogic/?p=5131
https://www.ibm.com/developerworks/community/groups/service/html/communityview?communityUuid=22d56091-3a7b-4497-b36e-634b51838e11

Sunday, 30 December 2012

Playing around with Solr

The following article shows how to add a custom field (for ISBN no.s) to be indexed by Solr:

http://robotlibrarian.billdueber.com/solr-field-type-for-numericish-ids/

Would love to write a custom one myself soon !

Thursday, 27 December 2012

Inspiring article on Big Data

http://gigaom.com/data/how-disney-built-a-big-data-platform-on-a-startup-budget/

"By the sheer power of its will (and ingenuity), a small team has been able to craft a large custom platform out of Hadoop, NoSQL databases and other open-source technologies. "

Saturday, 22 December 2012

How to download the historical versions of pages from google cache

Well you can get the historical versions of web pages from the google cache. I am not sure how old are these, if any one of you knows and cares to look it up please let me know too :P ?

The query you need to use to get an older version of webpage www.mypage.com from Google Cache is as follows:
http://webcache.googleusercontent.com/search?q=cache:www.mypage.com

To download the page using the wget linux command, you need to pass user-agent as google doesn't allow wget to fetch data, so here is how you can automate it:

wget --output-document=out.htm --user-agent=AGENT --level=1 http://webcache.googleusercontent.com/search?q=cache:www.mypage.com

Interesting isn't it :)

Tuesday, 18 December 2012

Using awk to find the sum of a CSV file after group by on a particular column

http://www.theunixschool.com/2012/06/awk-10-examples-to-group-data-in-csv-or.html

I have a CSV file like :


65523 , 100
65522 , 900
65522 , 1800
65522 , 100
65522 , 100
65521 , 500
65521 , 200
65521 , 200

I need to find the sum of the 2nd column based on the grouping by the 1st column, so that the output looks something like:


65523 , 100
65522 , 2900
65521 , 900

SOLUTION:

This can be easily achieved using a single line awk script:

awk -F"," '{a[$1]+=$2;}END{for (i in a)print i, a[i];}' file

Awesome isn't it !! :)




Sorting a CSV file based on a particular column list

http://stackoverflow.com/questions/9471101/sort-csv-file-by-column-priority-using-the-sort-command-unix

sort --field-separator=';' --key=2,1
sort -nr -t',' -k3


To sort based on multiple columns use the syntax:

sort --key=1,1 --key=2,2r --key=3,3 --key=4,4r
sort -k1,1 -k2,2r -k3,3 -k4,4r
as in the following transcript:
pax$ echo '5 3 2 9
3 4 1 7
5 2 3 1
6 1 3 6
1 2 4 5
3 1 2 3
5 2 2 3' | sort --key=1,1 --key=2,2r --key=3,3 --key=4,4r

1 2 4 5
3 4 1 7
3 1 2 3
5 3 2 9
5 2 2 3
5 2 3 1
6 1 3 6
Remember to provide the -n option if you want them treated as proper numbers (variable length), such as:
sort -n -k1,1 -k2,2r -k3,3 -k4,4r


Learning to use Screen command in Unix



If your local computer crashes, or you are connected via a modem and lose the connection, the processes or login sessions you establish through screen don't go away. You can resume your screen sessions with the following command: screen -r

screen                                ==>    Start a new screen
(Ctrl+A) & C                    ==>    Start a new screen sub-window
(Ctrl+A) & K                    ==>    Kill the current sub-window
(Ctrl+A) & (Shift + ")       ==>    Show the list of screens running on the system
screen -r                            ==>    restore to the old screens
screen -ls                           ==>    list of running screens
(Ctrl+A) & (Shift + A)      ==>   rename the current screen

Sunday, 16 December 2012

Setting up SVN on AWS EC2 instance

Steps to setup the SVN repository on the cloud instance (http://www.ange-agostini.com/blog/it/5-minutes-to-set-up-a-subversion-server-in-the-cloud.html):
Install subversion, apache and mod_dav_svn:
# sudo yum install mod_dav_svn subversion
Edit the Apache configuration file for subversion:
# sudo vi /etc/httpd/conf.d/subversion.conf
Replace subversion.conf content by:

LoadModule dav_svn_module     modules/mod_dav_svn.so
LoadModule authz_svn_module   modules/mod_authz_svn.so
<Location /repos>
   DAV svn
   SVNParentPath /var/www/svn
   # Limit write permission to list of valid users.
   AuthType Basic
   AuthName "Authorization Realm"
   AuthUserFile /var/www/svn-auth/passwd
   AuthzSVNAccessFile  /var/www/svn-auth/access
   Require valid-user
</Location>

Create the directory which will contain the subversion repository:
# sudo mkdir /var/www/svn
Create the directory which will contain the permissions files.
 # sudo mkdir /var/www/svn-auth
Create the permission file:
# sudo vi /var/www/svn-auth/access
[/]
<theUser> = rw
Note: Replace <theUser> by the login you want to use to access your repository.
<theUser> will have read write access to all repositories.
It is possible to setup authorization by group or user for each repository.
Create the password file:
# sudo htpasswd -cb /var/www/svn-auth/passwd <theUser> <thePassword>
Note: Replace <theUser> by the login you want to use to access your repository. Replace <thePassword> by the password you want.
Create a repository (here project1):
# cd /var/www/svn
# sudo svnadmin create project1
Change files authorization:
# sudo chown -R apache.apache /var/www/svn /var/wwws/vn-auth
# chmod 600 /var/wwws/vn-auth/access /var/www/svn-auth/passwd
Start apache web server:
# sudo service httpd start
Note: to restart server use # sudo service httpd restart

Test subversion

Now subversion and apache should work.
Open a web browser and point to the URL : http://<Public DNS of your EC2 instance>/repos/project1
You should be prompted for your credential (Enter <theUser> <thePassword>) before accessing the repository

Subversion client

We are now going to interact with our repository from a windows PC.
If you don’t have a subversion client installed on your PC then you can install one from http://www.sliksvn.com/en/download .
You can test your subversion client from your PC by listing files on your repository:
svn ls http://<Public DNS of your EC2 instance>/repos/project1
The first time we often want to import some files to the repository:
svn import -m "Initial import." <path of the reposity where are the files on your PC> http://<Public DNS of your EC2 instance>/repos/project1


---------------------------------------------------------------------------------------------------------------------------------
ANOTHER WAY:

Steps given at https://forums.aws.amazon.com/thread.jspa?messageID=209468:

I signed up for an EC2 free tier tonight intending to use it as a Subversion server. The whole process took a couple of hours to setup, but can be compressed down to just a few minutes. If anybody else is looking to do the same, here is what I did:

1) Create the Linux-flavored micro instance.
2) Give it a Security Group that opens port 3690 to the sources of your choice. The following example allows SVN access from all Internet sources:
  • | tcp | 3690 | 3690 | 0.0.0.0/0
  • | udp | 3690 | 3690 | 0.0.0.0/0
3) SSH into the micro instance and run the command "sudo yum install subversion".
4) Create the directory to be used for subversion, for example "/home/ec2-user/svn".
5) Enter "svnadmin create /home/ec2-user/svn" (or whichever path you specified).
6) Edit two files: svnserve.conf and passwd both found in the "conf" directory of your newly created repository source.
In svnserve.conf, unremark the lines for "anon-access" (but change from "read" to "none") and "auth-access". Additionally, unremark the "password-db = passwd" line, but don't change it, and unremark the "realm =" line, providing a realm name of your choice.
In the passwd file, simply add a line entry with the user name(s) and password(s) to be used for access.
7) Finally, run "svnserve -d -r /home/ec2-user/svn" (or again, whichever path you specified) to kick off the instance.

Your Subversion server should now be available at the URL: svn://<yourinstancehostorip>/

Keep in mind the security for this quick and dirty setup is very minimal, and the daemon has not been configured for automatic startup at this point.

Reading:

http://jonathanhui.com/install-configure-subversion-ec2-amazon-linux

Saturday, 15 December 2012

Setting up Nutch on MAC - issues [Solved]

I setup nutch following the official tutorial : http://wiki.apache.org/nutch/NutchTutorial

However, I am facing this error when i try to run the command "bin/nutch crawl urls -dir crawl -depth 3 -topN 5" :


2012-12-15 14:43:13.028 java[82636:c07] *** Terminating app due to uncaught exception 'JavaNativeException', reason: 'KrbException: Could not load configuration from SCDynamicStore'
*** First throw call stack:
(
0   CoreFoundation                      0x00007fff8b94cfc6 __exceptionPreprocess + 198
1   libobjc.A.dylib                     0x00007fff87621d5e objc_exception_throw + 43
2   CoreFoundation                      0x00007fff8b9d72a9 -[NSException raise] + 9
3   JavaNativeFoundation                0x0000000108aa1c3f JNFCallStaticVoidMethod + 213
4   libjava.jnilib                      0x0000000108ac1169 Java_sun_security_krb5_SCDynamicStoreConfig_installNotificationCallback + 450
5   JavaNativeFoundation                0x0000000108aa4182 JNFPerformEnvBlock + 86
6   SystemConfiguration                 0x00007fff8703d3b8 rlsPerform + 119
7   CoreFoundation                      0x00007fff8b8bb6e1 __CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
8   CoreFoundation                      0x00007fff8b8baf4d __CFRunLoopDoSources0 + 253
9   CoreFoundation                      0x00007fff8b8e1d39 __CFRunLoopRun + 905
10  CoreFoundation                      0x00007fff8b8e1676 CFRunLoopRunSpecific + 230
11  java                                0x00000001081c0843 java + 18499
12  java                                0x00000001081c029a java + 17050
13  java                                0x00000001081bda98 java + 6808
)
terminate called throwing an exceptionAbort trap: 6

About  SCDynamicStore : http://developer.apple.com/library/mac/#documentation/Networking/Reference/SCDynamicStore/Reference/reference.html

SOLUTION:

Put the following line in the bin/nutch file :

NUTCH_OPTS="$NUTCH_OPTS -Djava.security.krb5.realm= -Djava.security.krb5.kdc="

based on https://issues.apache.org/jira/browse/HADOOP-7489