Stuff I do: January 2013

Thursday, 31 January 2013

How to find the process which has the file handle open ?

A lot of times you need to find which Process (PID) has the handle to a particular file.
In linux doing this easy using the following command:

$ ls -ld1 /proc/*/fd/* | grep <file-name-filter>

Enjoy !!

Monday, 28 January 2013

Bloom filters and Count min Sketching Data structures

Another Good read:
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Original Post: http://lkozma.net/blog/sketching-data-structures/#comment-85892

“Sketching” data structures store a summary of a data set in situations where the whole data would be prohibitively costly to store (at least in a fast-access place like the memory as opposed to the hard disk). Variants of trees, hash tables, etc. are not sketching structures, they just facilitate access to the data, but they still store the data itself. However, the concept of hashing is closely related to most sketching ideas as we will see.

The main feature of sketching data structures is that they can answer certain questions about the data extremely efficiently, at the price of the occasional error. The best part is that the probability of an error can be quantified and the programmer can trade off the expected error rate with the amount of resources (storage, time) afforded. At the limit of this trade-off (when no error is allowed) these sketching structures collapse into traditional data structures.

Sketching data structures are somewhat counter-intuitive, but they can be useful in many real applications. I look at two such structures mostly for my own benefit: As I try to understand them, I write down my notes. Perhaps someone else will find them useful. Links to further information can be found in the end. Leave comments if you know of other sketching data structures that you found useful or if you have some favorite elegant and unusual data structure.

1. Bloom filter

Suppose we have to store a set of values (A) that come from a “universe” of possible values (U). Examples: IP addresses, words, names of people, etc. Then we need to check whether a new item xis a member of set A or not. For example, we might check if a word is spelled correctly by looking it up in the dictionary, or we can verify whether an IP address is banned by looking it up in our black list.

We could achieve this by storing the whole set A in our favorite data structure. Alternatively, we could just store a binary array, with one bit for each possible element in U. For example, to quickly check if a number is prime or not, we could precompute an array of bits for all numbers from 0 to the maximum value we need:

Prime = 001101010001010001010001...

To check whether a number is prime, we look at the corresponding bit and we are done. This is a dummy example, but it is already obvious that in most cases the range of possible values is too large to make this practical. The number of all possible strings of length 5, containing just letters from the English alphabet is 26^5 = 11,881,376 and in most real problems the universe U is much larger than that.

The magic of the Bloom filter allows us to get away with much less storage at the price of an occasional mistake. This mistake can only be a false positive, the Bloom filter might say that x is in A when in fact it is not. On the other hand, when it says that x is not in A, this is always true, in other words false negatives are impossible. In some applications (like the spell-checker), this is acceptable if false positives are not too frequent. In other applications (like the IP blacklist), misses are more common and in the case of a hit, we can verify the answer by reading the actual data from the more costly storage. In this case the Bloom filter can act as an efficiency layer in front of a more costly storage structure. If false positives can be tolerated, the Bloom filter can be used by itself.

The way it works is really simple: we use a binary array of size n, as in the prime numbers example, that is initialized with 0s. In this case however, n is much smaller than the total number of elements in U. For each element to be added to A, we compute k different hash values (using k independent hash functions) with results between 1 and n. We set all these locations h1, h2, …, hk (the indexes returned by the hash functions) in the binary array to 1. To check if y is in A, we compute the hash values h1(y), …, hk(y) and check the corresponding locations in the array. If at least one of them is 0, the element is missing. If all fields are 1, we can say that the element is present with a certainty that depends on n (the size of the array), k (the number of hashes) and the number of elements inserted. Note that n and k can be set beforehand by the programmer.

The source of this uncertainty is that hash values can collide. This becomes more of a problem as the array is filling up. If the array were full, the answer to all queries would be a yes. In this simple variant, deleting an element is not possible: we cannot just set the corresponding fields to 0, as this might interfere with other elements that were stored. There are many variants of Bloom filters, some allowing deletion and some allowing the storage of a few bits of data as well. For these and for some rigorous analysis, as well as some implementation tricks, see the links below.

A quick dummy example is a name database. Suppose we want to store female names and reject male names. We use two hash functions that return a number from 1 to 10 for any string.

Initial configuration: 0000000000
Insert("Sally")      : 0100000001
   # h1("Sally") = 2, h2("Sally") = 10
Insert("Jane")       : 1110000001
   # h1("Jane") = 1, h2("Jane") = 3
Insert("Mary")       : 1110100001
   # h1("Mary") = 5, h2("Mary") = 2 [collision]

Query("Sally")
   # bits 2 and 10 are set,
   # return HIT
Query("John")
   # h1("John") = 10 set, but h2("John") = 4 not set
   # return MISS
Query("Bob")
   # h1("Bob") = 5 set, h2("Bob") = 1 set
   # return HIT (false positive)

Wikipedia: Bloom filter (and variants)

Description and Ruby implementation

Quick introduction and demo

JavaScript demo

Bloom filter – the math

The original paper (Bloom, 1970)

Bloom filter, detailed survey

Tech talk: Bloom filter and variants

Bloom filter in C# using literate programming

Bloom filter in Ruby

2. Count-Min sketch

The Count-Min (CM) sketch is less known than the Bloom filter, but it is somewhat similar (especially to the counting variants of the Bloom filter). The problem here is to store a numerical value associated with each element, say the number of occurrences of the element in a stream (for example when counting accesses from different IP addresses to a server). Surprisingly, this can be done using less space than the number of elements, with the trade-off that the result can be slightly off sometimes, but mostly on the small values. Again, the parameters of the data structure can be chosen such as to obtain a desired accuracy.

CM works as follows: we have k different hash functions and kdifferent tables which are indexed by the outputs of these functions (note that the Bloom filter can be implemented in this way as well). The fields in the tables are now integer values. Initially we have all fields set to 0 (all unseen elements have count 0). When we increase the count of an element, we increment all the corresponding k fields in the different tables (given by the hash values of the element). If a decrease operation is allowed (which makes things more difficult), we similarly subtract a value from all k elements.

To obtain the count of an element, we take the minimum of the kfields that correspond to that element (as given by the hashes). This makes intuitive sense. Out of the k values, probably some have been incremented on other elements also (if there were collisions on the hash values). However, if not all k fields have been returned by the hash functions on other elements, the minimum will give the correct value. See illustration for an example on counting hits from IP addresses:

In this example the scenario could be that we want to notice if an IP address is responsible for a lot of traffic (to further investigate if there is a problem or some kind of attack). The CM structure allows us to do this without storing a record for each address. When we increment the fields corresponding to an address, simultaneously we check if the minimum is above some threshold and we do some costly operation if it is (which might be a false alert). On the other hand, the real count can never be larger than the reported number, so if the minimum is a small number, we don’t have to do anything (this holds for the presented simple variant that does not allow decreases). As the example shows, CM sketch is most useful for detecting “heavy hitters” in a stream.

It is interesting to note that if we take the CM data structure and make the counters such that they saturate at 1, we obtain the Bloom filter.

For further study, analysis of the data structure and variants, proper choice of parameters, see the following links:

CM sketch page

Original paper (Cormode, Muthukrishnan, 2005)

Lecture slides (Helsinki Univ. of Tech)

Sketch library C++

C and Java implementations

What is your favorite counter-intuitive data structure or algorithm?

Stealing browser history

http://merchantos.com/makebeta/tools/spyjax

Thursday, 24 January 2013

SPOJ problem ID [4]: Transform the Expression

Following is my first solution to the SPOJ problem 4: http://www.spoj.com/problems/ONP/

Python solution:

import sys
import string

operators = ['+', '-', '*', '/', '^']

n = int(raw_input())

def isoperator(char):
if char in operators:
return True
else:
return False

def process_line(line):
stck = []
output = ""
for i in range(0,len(line)):
if line[i].isalpha():
output += line[i]
else:
if isoperator(line[i]):
while len(stck) >0 and stck[-1] in operators and operators.index(line[i]) < operators.index(stck[-1]): #lower precendence of new operator, so pop the higher precendence from stck
output+=stck.pop()
stck.append(line[i])
elif line[i] == ')': #is a closing brace
while stck[-1] != '(' :
output+=stck.pop()
stck.pop()
else : # opening brace, and put it to stck
stck.append(line[i])
#print "stck :" + str(stck)
#print "output :" + output
while len(stck) > 0:
output+=stck.pop()
print output

while n > 0:
line = raw_input()
n-=1
process_line(line)

Sunday, 20 January 2013

Learning Python and Django

Print python module path

>>> import os
>>> print os.__file__
/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/os.pyc

>>> import django
>>> print django.__file__
/Library/Python/2.7/site-packages/django/__init__.pyc

Creating my first Django project
-------------------------------------------
1. django-admin.py startproject mysite

This creates the following structure:

mysite/
    manage.py
    mysite/
        __init__.py
        settings.py
        urls.py
        wsgi.py

These files are:

The outer mysite/ directory is just a container for your project. Its name doesn't matter to Django; you can rename it to anything you like.
manage.py: A command-line utility that lets you interact with this Django project in various ways. You can read all the details about manage.py in django-admin.py and manage.py.
The inner mysite/ directory is the actual Python package for your project. Its name is the Python package name you'll need to use to import anything inside it (e.g. import mysite.settings).
mysite/__init__.py: An empty file that tells Python that this directory should be considered a Python package. (Read more about packages in the official Python docs if you're a Python beginner.)
mysite/settings.py: Settings/configuration for this Django project. Django settings will tell you all about how settings work.
mysite/urls.py: The URL declarations for this Django project; a "table of contents" of your Django-powered site. You can read more about URLs in URL dispatcher.
mysite/wsgi.py: An entry-point for WSGI-compatible webservers to serve your project. See How to deploy with WSGI for more details.

The development server

Let's verify this worked. Change into the outer mysite directory, if you haven't already, and run the commandpython manage.py runserver.

2. Creating an app for polls under this project

python manage.py startapp polls

That'll create a directory polls, which is laid out like this:

polls/
    __init__.py
    models.py
    tests.py
    views.py

This directory structure will house the poll application.

Edit the polls/models.py file.

But first we need to tell our project that the polls app is installed.

Edit the settings.py file again, and change the INSTALLED_APPS setting to include the string 'polls'.

Now Django knows to include the polls app. Let's run another command:

python manage.py sql polls

The sql command doesn't actually run the SQL in your database - it just prints it to the screen so that you can see what SQL Django thinks is required. If you wanted to, you could copy and paste this SQL into your database prompt. However, as we will see shortly, Django provides an easier way of committing the SQL to the database.



If you're interested, also run the following commands:
python manage.py validate -- Checks for any errors in the construction of your models.
python manage.py sqlcustom polls -- Outputs any custom SQL statements (such as table modifications or constraints) that are defined for the application.
python manage.py sqlclear polls -- Outputs the necessary DROP TABLE statements for this app, according to which tables already exist in your database (if any).
python manage.py sqlindexes polls -- Outputs the CREATE INDEX statements for this app.
python manage.py sqlall polls -- A combination of all the SQL from the sql, sqlcustom, and sqlindexes commands.

Looking at the output of those commands can help you understand what's actually happening under the hood.

-

Questions:
-------------
What are the common practices used by python based websites to improve performance ?

Saturday, 19 January 2013

A closer look at Hadoop, Yahoo! developer tutorial

http://developer.yahoo.com/hadoop/tutorial/module4.html

CHAINING JOBS

You can easily chain jobs together in this fashion by writing multiple driver methods, one for each job. Call the first driver method, which uses JobClient.runJob() to run the job and wait for it to complete. When that job has completed, then call the next driver method, which creates a new JobConf object referring to different instances of Mapper and Reducer, etc. The first job in the chain should write its output to a path which is then used as the input path for the second job. This process can be repeated for as many jobs are necessary to arrive at a complete solution to the problem.

Hadoop provides another mechanism for managing batches of jobs with dependencies between jobs. Rather than submit a JobConf to the JobClient's runJob() or submitJob() methods,org.apache.hadoop.mapred.jobcontrol.Job objects can be created to represent each job; A Jobtakes a JobConf object as its constructor argument. Jobs can depend on one another through the use of theaddDependingJob() method.

The JobControl interface allows you to query it to retrieve the state of individual jobs, as well as the list of jobs waiting, ready, running, and finished. The job submission process does not begin until the run() method of the JobControl object is called.

DEBUGGING MAPREDUCE

Hadoop keeps logs of important events during program execution. By default, these are stored in the logs/ subdirectory of the hadoop-version/ directory where you run Hadoop from. Log files are named hadoop-username-service-hostname.log. The most recent data is in the .log file; older logs have their date appended to them.

The service name refers to which of the several Hadoop programs are writing the log; these can be jobtracker, namenode, datanode, secondarynamenode, or tasktracker. All of these are important for debugging a whole Hadoop installation. But for individual programs, the tasktracker logs will be the most relevant. Any exceptions thrown by your program will be recorded in the tasktracker logs.

Debugging in the distributed setting is complicated and requires logging into several machines to access log data. If possible, programs should be unit tested by running Hadoop locally.

LISTING AND KILLING JOBS

$ bin/hadoop job -list

bin/hadoop job -kill jobid

HADOOP STREAMING

Whereas Pipes is an API that provides close coupling between C++ application code and Hadoop, Streaming is a generic API that allows programs written in virtually any language to be used as Hadoop Mapper and Reducer implementations.

Hadoop Streaming allows you to use arbitrary programs for the Mapper and Reducer phases of a MapReduce job. Both Mappers and Reducers receive their input on stdin and emit output (key, value) pairs on stdout.

Input and output are always represented textually in Streaming. The input (key, value) pairs are written to stdin for a Mapper or Reducer, with a 'tab' character separating the key from the value. The Streaming programs should split the input on the first tab character on the line to recover the key and the value. Streaming programs write their output to stdout in the same format: key \t value \n.

The inputs to the reducer are sorted so that while each line contains only a single (key, value) pair, all the values for the same key are adjacent to one another.

Provided it can handle its input in the text format described above, any Linux program or tool can be used as the mapper or reducer in Streaming. You can also write your own scripts in bash, python, perl, or another language of your choice, provided that the necessary interpreter is present on all nodes in your cluster.

The command as shown, with no arguments, will print some usage information. An example of how to run real commands is given below:

$ bin/hadoop jar contrib/streaming-hadoop-0.18.0-streaming.jar -mapper \
    myMapProgram -reducer myReduceProgram -input /some/dfs/path \
    -output /some/other/dfs/path

This assumes that myMapProgram and myReduceProgram are present on all nodes in the system ahead of time. If this is not the case, but they are present on the node launching the job, then they can be "shipped" to the other nodes with the -file option:

$ bin/hadoop jar contrib/streaming-hadoop-0.18.0-streaming.jar -mapper \
    myMapProgram -reducer myReduceProgram -file \
    myMapProgram -file myReduceProgram -input some/dfs/path \
    -output some/other/dfs/path

Any other support files necessary to run your program can be shipped in this manner as well.