Monday, 22 April 2013

TopCoder SRM 571 - Div I Lvl 1

http://community.topcoder.com/stat?c=problem_statement&pm=12436&rd=15491
The following code prints the first 50 lexicographically sorted values of the records names:


class Test:
def __init__(self):
self.res = []
def f(self,n,x):
if x >n:
return
if len(self.res) < 50:
self.res.append(x)
#print x
else:
return

self.f(n,x*10)
rem = x%10
if rem != 9:
self.f(n,x+1)
def getRes(self,n,x):
self.f(n,x)
print self.res

Sample values printed from the program:
Test().getRes(1010,1)


1
10
100
1000
10000<
1001
10010<
1002
1003
:
1009
101
1010


TopCoder SRM 572 - Div I Lvl 1

http://community.topcoder.com/stat?c=round_overview&er=5&rd=15492


def myfunc(s,k):
    res = 0
    incr = len(s)-k
    for i in range(0,incr):
        # find the first set of alphabets which don't match
        j = i
        t = {}
        ctr = 0
        #print "i:",str(i)
        while j<len(s):
            #print "j:",str(j)
            if s[j] in t:
                t[s[j]] += 1
            else:
                t[s[j]] = 1
            j += incr
            ctr += 1
        mx = max(t.values())
        res += ctr - mx
   
    return res

Sunday, 21 April 2013

TopCoder SRM 573 - Div I Lvl 1

Div II Lvl 2 problem: easier version solution
http://community.topcoder.com/stat?c=problem_statement&pm=12471&rd=15493


def myfunc(s):
    our_strength = sum(s[0:3])- min(s[0:3])
    rst = s[3:]
    #print our_strength
    if len(rst) == 0:
        return 1
    # find the rank of our team, start with rank 1 and increment it as you find better ranks
    res = 1
    rst.sort()
    #print rst
    found = True
    while found == True:
        # get the max element
        max = rst[-1]
        rst.remove(max)
        # find the min element which is greater than our_rank
        found = False
        for i in rst:
            if i+max > our_strength:
                rst.remove(i)
                found = True
                res += 1
                break
    return res

http://apps.topcoder.com/wiki/display/tc/SRM+573

TopCoder SRM 575 - DIV 1 Level 1

http://community.topcoder.com/stat?c=problem_statement&pm=12496&rd=15495

The solution for the smaller problem:


def myfunc(n):
    f = [0] *(n+1)
    for i in range(2,n+1):
        for j in range(2,i):
            if(i%j==0 and f[i-j]==0):
                f[i] =1
    print [ (i,f[i]) for i in range(0,len(f))]

The solution for the bigger problem involves getting a pattern from the above solution.
http://apps.topcoder.com/wiki/display/tc/SRM+575

Tuesday, 2 April 2013

Using rsync to analyse hadoop cluster logs at the name node

If you have worked on a hadoop cluster and tried going through the logs on all different cluster nodes you know how painful can it be. This script can be run on the name node and it will copy all the logs for the given hadoop job-id to the current directory of the name node. You will have to change the rsync parameters to suit yourself.

Python script:

import sys
import subprocess
import os

if len(sys.argv) < 2:
        print "Usage: python ",sys.argv[0]," <hadoop-job-id>"
        sys.exit(0)

nodes = ['192.168.112.117', '192.168.156.63', '192.168.152.31', '192.168.112.118', '192.168.156.65', '192.168.156.62' ]

#subprocess.Popen(["ls","-lr"])

for node in nodes:
        subprocess.Popen(["rsync", "-rav", "root@"+node+":/data/hadoop_logs/userlogs/"+sys.argv[1], ".")

#os.system("rsync -rav root@node-0de4bd:/data/hadoop_logs/userlogs/"+sys.argv[1]+" .")

IMP: Rsync requires that you are able to do the password-less access to these nodes from the current node (on which you are running this script.) For help on setting up password-less access check here.

Thursday, 21 March 2013

Print to dynamic files from AWK cmd

To print the output of AWK to a dynamic file:

head  236_236_raw |  awk -F';' '{  print $7,"\t",$6 > ("_"$2"_"$3".txt")}'

If the number of files open are large, you will get the following error:


awk: * makes too many open files
 input record number 203, file
 source line number 1

To overcome the above problem, you can close the files like this:

head  236_236_raw |  awk -F';' '{  print $7,"\t",$6 >> ("_"$2"_"$3".txt"); close("_"$2"_"$3".txt")}'

The important point to note is that now we changed the redirection operator to ">>" from ">". Makes sense ? :)

Tuesday, 19 March 2013

Getting started with Facebook App development on Heroku environment

I am currently working on setting up the work environment for Facebook App development on Heroku server using python as the development language.

The complete tutorial is at: https://devcenter.heroku.com/articles/facebook

On high level the steps involve:
  1. Creating a new facebook app and checking the option for Heroku hosting.
  2. This creates a new domain on Heroku for your app and installs your app there.
  3. Now, you can 'git clone' the code on your local machine, make changes to the code and 'git push heroku master' to push changes back to the world.
  4. Need to setup the local environment for testing:
    1. This involves setting up the virtualenv for python. http://www.virtualenv.org/en/latest/
virtualenv is a tool to create isolated Python environments. The basic problem being addressed is one of dependencies and versions, and indirectly permissions. Imagine you have an application that needs version 1 of LibFoo, but another application requires version 2. How can you use both these applications? If you install everything into /usr/lib/python2.7/site-packages (or whatever your platform’s standard location is), it’s easy to end up in a situation where you unintentionally upgrade an application that shouldn’t be upgraded.

Or more generally, what if you want to install an application and leave it be? If an application works, any change in its libraries or the versions of those libraries can break the application. Also, what if you can’t install packages into the global site-packages directory? For instance, on a shared host.  

In all these cases, virtualenv can help you. It creates an environment that has its own installation directories, that doesn’t share libraries with other virtualenv environments (and optionally doesn’t access the globally installed libraries either).
-----------------------------
Using Virtualenv:
http://stackoverflow.com/questions/10763440/how-to-install-python3-version-of-package-via-pip

virtualenv -p /usr/bin/python3 py3env
source py3env/bin/activate
pip install package-name

---------------------------------

Now locally test your changes before deploying on the server.

Next steps:
- To be able to use Django here - https://devcenter.heroku.com/articles/django
- To be able to use a DB backend.

Monday, 18 March 2013

R, Getting Started

Download the R-package from :
http://www.r-project.org/

Download the R-studio IDE which has awesome features like auto-completion feature:

http://www.rstudio.com/

Installing a package in R (installed entropy package):
http://math.usask.ca/~longhai/software/installrpkg.html

http://math.usask.ca/~longhai/doc/others/R-tutorial.pdf

R Examples:

http://www.rexamples.com/

http://www.mayin.org/ajayshah/KB/R/index.html


-------------------------- R code to read a file into R variables -------------------

# Goal: To read in a simple data file, and look around it's contents.

# Suppose you have a file "x.data" which looks like this:
#        1997,3.1,4
#        1998,7.2,19
#        1999,1.7,2
#        2000,1.1,13
# To read it in --

A <- read.table("x.data", sep=",",
                col.names=c("year", "my1", "my2"))
nrow(A)                                 # Count the rows in A

summary(A$year)                         # The column "year" in data frame A
                                        # is accessed as A$year

A$newcol <- A$my1 + A$my2               # Makes a new column in A
newvar <- A$my1 - A$my2                 # Makes a new R object "newvar"
A$my1 <- NULL                           # Removes the column "my1"

# You might find these useful, to "look around" a dataset --
str(A)
summary(A)
library(Hmisc)          # This requires that you've installed the Hmisc package
contents(A)
describe(A)

-------------------------- R steps to load library --------------------------------------

library ('entropy') # loads the library

??entropy # shows the hep for the entropy package



Wednesday, 13 March 2013

Entropy calculation

http://tkhanson.net/cgit.cgi/misc.git/plain/entropy/Entropy.html

http://arxiv.org/pdf/0808.1771v2.pdf

http://vserver1.cscs.lsa.umich.edu/~crshalizi/notabene/cep-gzip.html

http://www.gzip.org/deflate.html

http://log.brandonthomson.com/2011/01/quick-python-gzip-vs-bz2-benchmark.html

http://www.gzip.org/deflate.html

Friday, 8 March 2013

Probability Theory Basics

In probability and statistics, a random variable or stochastic variable is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense). As opposed to other mathematical variables, a random variable conceptually does not have a single, fixed value (even if unknown); rather, it can take on a set of possible different values, each with an associated probability.
Random variables can be classified as either discrete (i.e. it may assume any of a specified list of exact values) or as continuous (i.e. it may assume any numerical value in an interval or collection of intervals). The mathematical function describing the possible values of a random variable and their associated probabilities is known as a probability distribution.

discrete probability distribution shall be understood as a probability distribution characterized by a probability mass function. Thus, the distribution of a random variable X is discrete, and X is then called a discrete random variable, if
\sum_u \Pr(X=u) = 1
as u runs through the set of all possible values of X. It follows that such a random variable can assume only a finite or countably infinite number of values.

Intuitively, a continuous random variable is the one which can take a continuous range of values — as opposed to a discrete distribution, where the set of possible values for the random variable is at mostcountable. While for a discrete distribution an event with probability zero is impossible (e.g. rolling 3½ on a standard die is impossible, and has probability zero), this is not so in the case of a continuous random variable. For example, if one measures the width of an oak leaf, the result of 3½ cm is possible, however it has probability zero because there are uncountably many other potential values even between 3 cm and 4 cm. Each of these individual outcomes has probability zero, yet the probability that the outcome will fall into the interval (3 cm, 4 cm) is nonzero. This apparent paradox is resolved by the fact that the probability that X attains some value within an infinite set, such as an interval, cannot be found by naively adding the probabilities for individual values. Formally, each value has an infinitesimally small probability, which statistically is equivalent to zero.
Formally, if X is a continuous random variable, then it has a probability density function Æ’(x), and therefore its probability of falling into a given interval, say [ab] is given by the integral

    \Pr[a\le X\le b] = \int_a^b f(x) \, dx
In particular, the probability for X to take any single value a (that is a ≤ X ≤ a) is zero, because an integral with coinciding upper and lower limits is always equal to zero.

For a Random Variable, it is often enough to know what its "average value" is. This is captured by the mathematical concept of expected value of a random variable, denoted E[X], and also called the first moment. In general, E[f(X)] is not equal to f(E[X]). Once the "average value" is known, one could then ask how far from this average value the values of X typically are, a question that is answered by the variance and standard deviation of a random variable. E[X] can be viewed intuitively as an average obtained from an infinite population, the members of which are particular evaluations of X.
---------------------------------

In probability theory, the normal (or Gaussiandistribution is a continuous probability distribution, defined by the formula

f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{(x-\mu)^2}{2\sigma^2} }.
The parameter Î¼ in this formula is the mean or expectation of the distribution (and also its median and mode). The parameter Ïƒ is its standard deviation; its variance is therefore Ïƒ 2. A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.


Standard normal distribution

The simplest case of a normal distribution is known as the standard normal distribution, described by this probability density function:
\phi(x) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2}.
The factor \scriptstyle\ 1/\sqrt{2\pi} in this expression ensures that the total area under the curve Ï•(x) is equal to one[proof]. The 12 in the exponent ensures that the distribution has unit variance (and therefore also unit standard deviation). This function is symmetric around x=0, where it attains its maximum value 1/\sqrt{2\pi}; and has inflection points at +1 and −1.

[edit]



The cumulative distribution function (CDF) of a random variable is the probability of its value falling in the interval [-\infty, x], as a function of x. The CDF of the standard normal distribution, usually denoted with the capital Greek letter \Phi (phi), is the integral
\Phi(x)\; = \;\frac{1}{\sqrt{2\pi}} \int_{-\infty}^x e^{-t^2/2} \, dt


-----------------------------


In probability theorythe central limit theorem (CLT) states that, given certain conditions, the mean of a sufficiently large number of independent random variables, each with a well-defined mean and well-defined variance, will be approximately normally distributed.[1] The central limit theorem has a number of variants. In its common form, the random variables must be identically distributed. In variants, convergence of the mean to the normal distribution also occurs for non-identical distributions, given that they comply with certain conditions.

et {X1, ..., Xn} be a random sample of size n—that is, a sequence of independent and identically distributed random variables drawn from distributions of expected values given by µ and finite variances given by σ2. Suppose we are interested in the sample average
S_n := \frac{X_1+\cdots+X_n}{n}
of these random variables. By the law of large numbers, the sample averages converge in probability and almost surely to the expected value µ as n → ∞. The classical central limit theorem describes the size and the distributional form of the stochastic fluctuations around the deterministic number µ during this convergence. More precisely, it states that as n gets larger, the distribution of the difference between the sample average Sn and its limit µ, when multiplied by the factor n (that is n(Sn − µ)), approximates the normal distribution with mean 0 and variance σ2. For large enough n, the distribution of Sn is close to the normal distribution with mean µ and variance Ïƒ2n. The usefulness of the theorem is that the distribution of n(Sn − µ) approaches normality regardless of the shape of the distribution of the individual Xi’s. Formally, the theorem can be stated as follows:
Lindeberg–Lévy CLT. Suppose {X1X2, ...} is a sequence of i.i.d. random variables with E[Xi] = µ and Var[Xi] = σ2 < ∞. Then as n approaches infinity, the random variables n(Sn − µ) converge in distribution to a normal N(0, σ2):[3]
\sqrt{n}\bigg(\bigg(\frac{1}{n}\sum_{i=1}^n X_i\bigg) - \mu\bigg)\ \xrightarrow{d}\ N(0,\;\sigma^2).

---------------------------------------------

In probability theory, a random variable is said to be stable (or to have a stable distribution) if it has the property that a linear combination of twoindependent copies of the variable has the same distributionup to location and scale parameters.


Such distributions form a four-parameter family of continuous probability distributions parametrized by location and scale parameters μ and c, respectively, and two shape parameters β and α, roughly corresponding to measures of asymmetry and concentration, respectively (see the figures).

Wednesday, 6 March 2013

Using memcache with Java



Memcache Overview

- A distributed cache
- Store the key/value pairs in the cache. The value needs to be serializable since the cache is distributed.
- You can use this to cache Database objects as well as the generated HTML page markups from the servlets.
- The nodes of the cache cluster, don't know about each other and don't interact with each other
- The client however, knows about all the nodes in the memcache cluster
- The client hashes the string key and maps the hash value to one of the servers. The get() or set() requests for that key goes to the selected node of the server.
- The client sends the set() request asynchronously to the server using a daemon thread. -- ???
- This is good for web applications which need to scale massively.
- You can use the TELNET mode to connect and issue commands to the memcache cluster for debugging purpose.
- The spymemcached client lib for Java easily integrates with the hibernate.


http://www.javaworld.com/javaworld/jw-05-2012/120515-memcached-for-java-enterprise-performance-2.html?page=4


Sunday, 17 February 2013

Understanding open source architectures I

https://github.com/foursquare
https://code.google.com/p/foursquared/wiki/Design
https://code.google.com/p/foursquared/source/browse
https://sites.google.com/site/foursquareddev/ux/venueactivity

https://github.com/foursquare/Android-PullToRefresh
https://github.com/foursquare/SlidingMenu

Eclipse:
http://www.aosabook.org/en/eclipse.html

Graphite:
http://www.aosabook.org/en/graphite.html

http://www.tornadoweb.org/

http://blog.reddit.com/2008/06/reddit-goes-open-source.html

https://code.google.com/p/guava-libraries/wiki/GuavaExplained



Apache Mahout

Collaborative filtering (CF) is a technique used by some recommender systems. Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly. 

http://en.wikipedia.org/wiki/Collaborative_filtering

Currently Mahout supports mainly four use cases: 

  1. Recommendation mining takes users' behavior and from that tries to find items users might like. 
  2. Clustering takes e.g. text documents and groups them into groups of topically related documents. 
  3. Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category. 
  4. Frequent itemset mining takes a set of item groups (terms in a query session, shopping cart content) and identifies, which individual items usually appear together.

Best command line tools for Linux, MAC

Building a search engine based on Inverted indexes

Monday, 4 February 2013

Using multi-dimensional associative arrays in AWK

Examples:

cat /tmp/33 | awk -F'[\\^\\[\\]]' '{c[$5,$6]+=1}END{for( i in c) {split(i,sep,SUBSEP); print sep[1],sep[2], c[i] ;}}' 

Reading:

Sunday, 3 February 2013

Difference between ClassNotFoundException and NoClassDefFoundError

http://stackoverflow.com/questions/1457863/what-is-the-difference-between-noclassdeffounderror-and-classnotfoundexception


The difference from the Java API Specifications is as follows.
Thrown when an application tries to load in a class through its string name using:
  • The forName method in class Class.
  • The findSystemClass method in class ClassLoader.
  • The loadClass method in class ClassLoader.
but no definition for the class with the specified name could be found.
Thrown if the Java Virtual Machine or a ClassLoader instance tries to load in the definition of a class (as part of a normal method call or as part of creating a new instance using the new expression) and no definition of the class could be found.
The searched-for class definition existed when the currently executing class was compiled, but the definition can no longer be found.
So, it appears that the NoClassDefFoundError occurs when the source was successfully compiled, but at runtime, the required class files were not found. This may be something that can happen in the distribution or production of JAR files, where not all the required class files were included.
As for ClassNotFoundException, it appears that it may stem from trying to make reflective calls to classes at runtime, but the classes the program is trying to call is does not exist.
The difference between the two is that one is an Error and the other is an Exception. WithNoClassDefFoundError is an Error and it arises from the Java Virtual Machine having problems finding a class it expected to find. A program that was expected to work at compile-time can't run because of class files not being found, or is not the same as was produced or encountered at compile-time. This is a pretty critical error, as the program cannot be initiated by the JVM.
On the other hand, the ClassNotFoundException is an Exception, so it is somewhat expected, and is something that is recoverable. Using reflection is can be error-prone (as there is some expectations that things may not go as expected. There is no compile-time check to see that all the required classes exist, so any problems with finding the desired classes will appear at runtime.



Thursday, 31 January 2013

How to find the process which has the file handle open ?

A lot of times you need to find which Process (PID) has the handle to a particular file.
In linux doing this easy using the following command:

$ ls -ld1 /proc/*/fd/* | grep <file-name-filter>

Enjoy !!

Monday, 28 January 2013

Bloom filters and Count min Sketching Data structures

Another Good read:
http://highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/

Original Post: http://lkozma.net/blog/sketching-data-structures/#comment-85892


“Sketching” data structures store a summary of a data set in situations where the whole data would be prohibitively costly to store (at least in a fast-access place like the memory as opposed to the hard disk). Variants of trees, hash tables, etc. are not sketching structures, they just facilitate access to the data, but they still store the data itself. However, the concept of hashing is closely related to most sketching ideas as we will see.
The main feature of sketching data structures is that they can answer certain questions about the data extremely efficiently, at the price of the occasional error. The best part is that the probability of an error can be quantified and the programmer can trade off the expected error rate with the amount of resources (storage, time) afforded. At the limit of this trade-off (when no error is allowed) these sketching structures collapse into traditional data structures.
Sketching data structures are somewhat counter-intuitive, but they can be useful in many real applications. I look at two such structures mostly for my own benefit: As I try to understand them, I write down my notes. Perhaps someone else will find them useful. Links to further information can be found in the end. Leave comments if you know of other sketching data structures that you found useful or if you have some favorite elegant and unusual data structure.

1. Bloom filter

Suppose we have to store a set of values (A) that come from a “universe” of possible values (U). Examples: IP addresses, words, names of people, etc. Then we need to check whether a new item xis a member of set A or not. For example, we might check if a word is spelled correctly by looking it up in the dictionary, or we can verify whether an IP address is banned by looking it up in our black list.
We could achieve this by storing the whole set in our favorite data structure. Alternatively, we could just store a binary array, with one bit for each possible element in U. For example, to quickly check if a number is prime or not, we could precompute an array of bits for all numbers from 0 to the maximum value we need:
Prime = 001101010001010001010001...
To check whether a number is prime, we look at the corresponding bit and we are done. This is a dummy example, but it is already obvious that in most cases the range of possible values is too large to make this practical. The number of all possible strings of length 5, containing just letters from the English alphabet is 26^5 = 11,881,376 and in most real problems the universe U is much larger than that.
The magic of the Bloom filter allows us to get away with much less storage at the price of an occasional mistake. This mistake can only be a false positive, the Bloom filter might say that x is in A when in fact it is not. On the other hand, when it says that x is not in A, this is always true, in other words false negatives are impossible. In some applications (like the spell-checker), this is acceptable if false positives are not too frequent. In other applications (like the IP blacklist), misses are more common and in the case of a hit, we can verify the answer by reading the actual data from the more costly storage. In this case the Bloom filter can act as an efficiency layer in front of a more costly storage structure. If false positives can be tolerated, the Bloom filter can be used by itself.
The way it works is really simple: we use a binary array of size n, as in the prime numbers example, that is initialized with 0s. In this case however, n is much smaller than the total number of elements in U. For each element to be added to A, we compute different hash values (using k independent hash functions) with results between 1 and n. We set all these locations h1, h2, …, hk (the indexes returned by the hash functions) in the binary array to 1. To check if y is in A, we compute the hash values h1(y), …, hk(y) and check the corresponding locations in the array. If at least one of them is 0, the element is missing. If all fields are 1, we can say that the element is present with a certainty that depends on n (the size of the array), k (the number of hashes) and the number of elements inserted. Note that n and k can be set beforehand by the programmer.
The source of this uncertainty is that hash values can collide. This becomes more of a problem as the array is filling up. If the array were full, the answer to all queries would be a yes. In this simple variant, deleting an element is not possible: we cannot just set the corresponding fields to 0, as this might interfere with other elements that were stored. There are many variants of Bloom filters, some allowing deletion and some allowing the storage of a few bits of data as well. For these and for some rigorous analysis, as well as some implementation tricks, see the links below.
A quick dummy example is a name database. Suppose we want to store female names and reject male names. We use two hash functions that return a number from 1 to 10 for any string.
Initial configuration: 0000000000
Insert("Sally")      : 0100000001
   # h1("Sally") = 2, h2("Sally") = 10
Insert("Jane")       : 1110000001
   # h1("Jane") = 1, h2("Jane") = 3
Insert("Mary")       : 1110100001
   # h1("Mary") = 5, h2("Mary") = 2 [collision]

Query("Sally")
   # bits 2 and 10 are set,
   # return HIT
Query("John")
   # h1("John") = 10 set, but h2("John") = 4 not set
   # return MISS
Query("Bob")
   # h1("Bob") = 5 set, h2("Bob") = 1 set
   # return HIT (false positive)
The original paper (Bloom, 1970)

2. Count-Min sketch

The Count-Min (CM) sketch is less known than the Bloom filter, but it is somewhat similar (especially to the counting variants of the Bloom filter). The problem here is to store a numerical value associated with each element, say the number of occurrences of the element in a stream (for example when counting accesses from different IP addresses to a server). Surprisingly, this can be done using less space than the number of elements, with the trade-off that the result can be slightly off sometimes, but mostly on the small values. Again, the parameters of the data structure can be chosen such as to obtain a desired accuracy.
CM works as follows: we have k different hash functions and kdifferent tables which are indexed by the outputs of these functions (note that the Bloom filter can be implemented in this way as well). The fields in the tables are now integer values. Initially we have all fields set to 0 (all unseen elements have count 0). When we increase the count of an element, we increment all the corresponding k fields in the different tables (given by the hash values of the element). If a decrease operation is allowed (which makes things more difficult), we similarly subtract a value from all k elements.
To obtain the count of an element, we take the minimum of the kfields that correspond to that element (as given by the hashes). This makes intuitive sense. Out of the k values, probably some have been incremented on other elements also (if there were collisions on the hash values). However, if not all k fields have been returned by the hash functions on other elements, the minimum will give the correct value. See illustration for an example on counting hits from IP addresses:
In this example the scenario could be that we want to notice if an IP address is responsible for a lot of traffic (to further investigate if there is a problem or some kind of attack). The CM structure allows us to do this without storing a record for each address. When we increment the fields corresponding to an address, simultaneously we check if the minimum is above some threshold and we do some costly operation if it is (which might be a false alert). On the other hand, the real count can never be larger than the reported number, so if the minimum is a small number, we don’t have to do anything (this holds for the presented simple variant that does not allow decreases). As the example shows, CM sketch is most useful for detecting “heavy hitters” in a stream.
It is interesting to note that if we take the CM data structure and make the counters such that they saturate at 1, we obtain the Bloom filter.
For further study, analysis of the data structure and variants, proper choice of parameters, see the following links:
Original paper (Cormode, Muthukrishnan, 2005)
Lecture slides (Helsinki Univ. of Tech)
What is your favorite counter-intuitive data structure or algorithm?