Stuff I do: March 2013

Thursday 21 March 2013

Print to dynamic files from AWK cmd

To print the output of AWK to a dynamic file:

head 236_236_raw | awk -F';' '{ print $7,"\t",$6 > ("_"$2"_"$3".txt")}'

If the number of files open are large, you will get the following error:

awk: * makes too many open files

input record number 203, file

source line number 1

To overcome the above problem, you can close the files like this:

head 236_236_raw | awk -F';' '{ print $7,"\t",$6 >> ("_"$2"_"$3".txt"); close("_"$2"_"$3".txt")}'

The important point to note is that now we changed the redirection operator to ">>" from ">". Makes sense ? :)

Tuesday 19 March 2013

Getting started with Facebook App development on Heroku environment

I am currently working on setting up the work environment for Facebook App development on Heroku server using python as the development language.

The complete tutorial is at: https://devcenter.heroku.com/articles/facebook

On high level the steps involve:

Creating a new facebook app and checking the option for Heroku hosting.
This creates a new domain on Heroku for your app and installs your app there.
Now, you can 'git clone' the code on your local machine, make changes to the code and 'git push heroku master' to push changes back to the world.
Need to setup the local environment for testing:

This involves setting up the virtualenv for python. http://www.virtualenv.org/en/latest/

virtualenv is a tool to create isolated Python environments. The basic problem being addressed is one of dependencies and versions, and indirectly permissions. Imagine you have an application that needs version 1 of LibFoo, but another application requires version 2. How can you use both these applications? If you install everything into /usr/lib/python2.7/site-packages (or whatever your platform’s standard location is), it’s easy to end up in a situation where you unintentionally upgrade an application that shouldn’t be upgraded.

Or more generally, what if you want to install an application and leave it be? If an application works, any change in its libraries or the versions of those libraries can break the application. Also, what if you can’t install packages into the global site-packages directory? For instance, on a shared host.

In all these cases, virtualenv can help you. It creates an environment that has its own installation directories, that doesn’t share libraries with other virtualenv environments (and optionally doesn’t access the globally installed libraries either).

-----------------------------
Using Virtualenv:
http://stackoverflow.com/questions/10763440/how-to-install-python3-version-of-package-via-pip

virtualenv -p /usr/bin/python3 py3env
source py3env/bin/activate
pip install package-name

---------------------------------

Now locally test your changes before deploying on the server.

Next steps:

- https://devcenter.heroku.com/articles/python

- To be able to use Django here - https://devcenter.heroku.com/articles/django

- To be able to use a DB backend.

Monday 18 March 2013

R, Getting Started

Download the R-package from :
http://www.r-project.org/

Download the R-studio IDE which has awesome features like auto-completion feature:

http://www.rstudio.com/

Installing a package in R (installed entropy package):
http://math.usask.ca/~longhai/software/installrpkg.html

http://math.usask.ca/~longhai/doc/others/R-tutorial.pdf

R Examples:

http://www.rexamples.com/

http://www.mayin.org/ajayshah/KB/R/index.html

-------------------------- R code to read a file into R variables -------------------

# Goal: To read in a simple data file, and look around it's contents.

# Suppose you have a file "x.data" which looks like this:
#        1997,3.1,4
#        1998,7.2,19
#        1999,1.7,2
#        2000,1.1,13
# To read it in --

A <- read.table("x.data", sep=",",
                col.names=c("year", "my1", "my2"))
nrow(A)                                 # Count the rows in A

summary(A$year)                         # The column "year" in data frame A
                                        # is accessed as A$year

A$newcol <- A$my1 + A$my2               # Makes a new column in A
newvar <- A$my1 - A$my2                 # Makes a new R object "newvar"
A$my1 <- NULL                           # Removes the column "my1"

# You might find these useful, to "look around" a dataset --
str(A)
summary(A)
library(Hmisc)          # This requires that you've installed the Hmisc package
contents(A)
describe(A)

-------------------------- R steps to load library --------------------------------------

library ('entropy') # loads the library

??entropy # shows the hep for the entropy package

Wednesday 13 March 2013

Entropy calculation

http://tkhanson.net/cgit.cgi/misc.git/plain/entropy/Entropy.html

http://arxiv.org/pdf/0808.1771v2.pdf

http://vserver1.cscs.lsa.umich.edu/~crshalizi/notabene/cep-gzip.html

http://www.gzip.org/deflate.html

http://log.brandonthomson.com/2011/01/quick-python-gzip-vs-bz2-benchmark.html

http://www.gzip.org/deflate.html

Friday 8 March 2013

Probability Theory Basics

In probability and statistics, a random variable or stochastic variable is a variable whose value is subject to variations due to chance (i.e. randomness, in a mathematical sense). As opposed to other mathematical variables, a random variable conceptually does not have a single, fixed value (even if unknown); rather, it can take on a set of possible different values, each with an associated probability.

Random variables can be classified as either discrete (i.e. it may assume any of a specified list of exact values) or as continuous (i.e. it may assume any numerical value in an interval or collection of intervals). The mathematical function describing the possible values of a random variable and their associated probabilities is known as a probability distribution.

A discrete probability distribution shall be understood as a probability distribution characterized by a probability mass function. Thus, the distribution of a random variable X is discrete, and X is then called a discrete random variable, if

$\sum_u \Pr(X=u) = 1$

as u runs through the set of all possible values of X. It follows that such a random variable can assume only a finite or countably infinite number of values.

Intuitively, a continuous random variable is the one which can take a continuous range of values — as opposed to a discrete distribution, where the set of possible values for the random variable is at mostcountable. While for a discrete distribution an event with probability zero is impossible (e.g. rolling 3½ on a standard die is impossible, and has probability zero), this is not so in the case of a continuous random variable. For example, if one measures the width of an oak leaf, the result of 3½ cm is possible, however it has probability zero because there are uncountably many other potential values even between 3 cm and 4 cm. Each of these individual outcomes has probability zero, yet the probability that the outcome will fall into the interval (3 cm, 4 cm) is nonzero. This apparent paradox is resolved by the fact that the probability that X attains some value within an infinite set, such as an interval, cannot be found by naively adding the probabilities for individual values. Formally, each value has an infinitesimally small probability, which statistically is equivalent to zero.

Formally, if X is a continuous random variable, then it has a probability density function ƒ(x), and therefore its probability of falling into a given interval, say [a, b] is given by the integral

$\Pr[a\le X\le b] = \int_a^b f(x) \, dx$

In particular, the probability for X to take any single value a (that is a ≤ X ≤ a) is zero, because an integral with coinciding upper and lower limits is always equal to zero.

For a Random Variable, it is often enough to know what its "average value" is. This is captured by the mathematical concept of expected value of a random variable, denoted E[X], and also called the first moment. In general, E[f(X)] is not equal to f(E[X]). Once the "average value" is known, one could then ask how far from this average value the values of X typically are, a question that is answered by the variance and standard deviation of a random variable. E[X] can be viewed intuitively as an average obtained from an infinite population, the members of which are particular evaluations of X.

---------------------------------

In probability theory, the normal (or Gaussian) distribution is a continuous probability distribution, defined by the formula

$f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{ -\frac{(x-\mu)^2}{2\sigma^2} }.$

The parameter μ in this formula is the mean or expectation of the distribution (and also its median and mode). The parameter σ is its standard deviation; its variance is therefore σ². A random variable with a Gaussian distribution is said to be normally distributed and is called a normal deviate.

Standard normal distribution

The simplest case of a normal distribution is known as the standard normal distribution, described by this probability density function:

$\phi(x) = \frac{1}{\sqrt{2\pi}}\, e^{- \frac{\scriptscriptstyle 1}{\scriptscriptstyle 2} x^2}.$

The factor $\scriptstyle\ 1/\sqrt{2\pi}$ in this expression ensures that the total area under the curve ϕ(x) is equal to one^[proof]. The 12 in the exponent ensures that the distribution has unit variance (and therefore also unit standard deviation). This function is symmetric around x=0, where it attains its maximum value $1/\sqrt{2\pi}$ ; and has inflection points at +1 and −1.

[edit]

The cumulative distribution function (CDF) of a random variable is the probability of its value falling in the interval $[-\infty, x]$ , as a function of x. The CDF of the standard normal distribution, usually denoted with the capital Greek letter $\Phi$ (phi), is the integral

$\Phi(x)\; = \;\frac{1}{\sqrt{2\pi}} \int_{-\infty}^x e^{-t^2/2} \, dt$

-----------------------------

In probability theory, the central limit theorem (CLT) states that, given certain conditions, the mean of a sufficiently large number of independent random variables, each with a well-defined mean and well-defined variance, will be approximately normally distributed.^[1] The central limit theorem has a number of variants. In its common form, the random variables must be identically distributed. In variants, convergence of the mean to the normal distribution also occurs for non-identical distributions, given that they comply with certain conditions.

et {X₁, ..., X_n} be a random sample of size n—that is, a sequence of independent and identically distributed random variables drawn from distributions of expected values given by µ and finite variances given by σ². Suppose we are interested in the sample average

$S_n := \frac{X_1+\cdots+X_n}{n}$

of these random variables. By the law of large numbers, the sample averages converge in probability and almost surely to the expected value µ as n → ∞. The classical central limit theorem describes the size and the distributional form of the stochastic fluctuations around the deterministic number µ during this convergence. More precisely, it states that as n gets larger, the distribution of the difference between the sample average S_n and its limit µ, when multiplied by the factor √n (that is √n(S_n − µ)), approximates the normal distribution with mean 0 and variance σ². For large enough n, the distribution of S_n is close to the normal distribution with mean µ and variance σ²n. The usefulness of the theorem is that the distribution of √n(S_n − µ) approaches normality regardless of the shape of the distribution of the individual X_i’s. Formally, the theorem can be stated as follows:

Lindeberg–Lévy CLT. Suppose {X₁, X₂, ...} is a sequence of i.i.d. random variables with E[X_i] = µ and Var[X_i] = σ² < ∞. Then as n approaches infinity, the random variables √n(S_n − µ) converge in distribution to a normal N(0, σ²):^[3]

$\sqrt{n}\bigg(\bigg(\frac{1}{n}\sum_{i=1}^n X_i\bigg) - \mu\bigg)\ \xrightarrow{d}\ N(0,\;\sigma^2).$

---------------------------------------------

In probability theory, a random variable is said to be stable (or to have a stable distribution) if it has the property that a linear combination of twoindependent copies of the variable has the same distribution, up to location and scale parameters.

Such distributions form a four-parameter family of continuous probability distributions parametrized by location and scale parameters μ and c, respectively, and two shape parameters β and α, roughly corresponding to measures of asymmetry and concentration, respectively (see the figures).

Wednesday 6 March 2013

Using memcache with Java

Memcache Overview

- A distributed cache
- Store the key/value pairs in the cache. The value needs to be serializable since the cache is distributed.
- You can use this to cache Database objects as well as the generated HTML page markups from the servlets.
- The nodes of the cache cluster, don't know about each other and don't interact with each other
- The client however, knows about all the nodes in the memcache cluster
- The client hashes the string key and maps the hash value to one of the servers. The get() or set() requests for that key goes to the selected node of the server.
- The client sends the set() request asynchronously to the server using a daemon thread. -- ???
- This is good for web applications which need to scale massively.
- You can use the TELNET mode to connect and issue commands to the memcache cluster for debugging purpose.
- The spymemcached client lib for Java easily integrates with the hibernate.

http://www.javaworld.com/javaworld/jw-05-2012/120515-memcached-for-java-enterprise-performance-2.html?page=4