User:Prasenjitmukherjee

tPrasen Mukherjee[edit]

I live in Hyderabad and work for Bing!. I am interested in data mining, mathematics, physics, religion and lots of other random topics.

Resume[edit]

Google Doc Link to my resume

What do I do[edit]

Clustering in Mobile Ad Network Space
- Extended Fielder Method ( Based on SVD decomposition )
- Stacked RBMs ( aka Restricted Boltzmann Machine )

Earlier Work[edit]

Architecting Mapquest/Mobile Platform Solutions[edit]

Mapquest
- Single Line Geo Search
  - Extracting contextual location based on short term browser behavior
- Mood based route recommendation
- WebTileCache

Mobile AIM Platform
- Wireless Village/IMPS

Open Mobile SDK effort (http://openmobileplt.sourceforge.net/)
- AirMedia Platform

Spearheading R&D initiatives[edit]

Leading R&D initiative by collaborating with premier research inistitutes
Leading open source Machine Learning initiatives (http://lucene.apache.org/mahout)

Ad monetization (Mobile/Social-Networking)[edit]

Interact/Influence different business units to aid revenue-generation and monetization
Apply technology to gauge trends and user-behaviour
- Spot market opportunities

Thoughts on Personalized Recommended Systems[edit]

Recommendation System ( common for Mobile/Desktop)[edit]

Recommend (news) articles
Recommend upcoming events
Personalized browsing - Idea here is to learn the immediate intent of user dynamically ( in the current session as he broswes more and more pages ) from his past history, and recommend articles based on that.

Temporal profiling of user log data[edit]

Analyze distribution of personal topics ( probably extracted via PLSI,LDA etc.) over time
Use the above distribution data to group temporally similar users
Reference papers :
- Andrew McCallum's Publications
- A Continuous-Time Model of Topic Co-occurrence Trends. Xuerui Wang, Wei Li, and Andrew McCallum. AAAI Workshop on Event Detection, 2006. Capture the time distributions not only of a topics, but also of their co-occurrences. For example, notice that while NLP and ML have both been around for a long time, but their co-occurrence has been rising recently. The model is effectively a combination of the Pachinko Allocation Model (PAM) and Topics-Over-Time (TOT).
- Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. Xuerui Wang and Andrew McCallum. Conference on Knowledge Discovery and Data Mining (KDD) 2006. A new LDA-style topic model that models trends over time. The meaning of a topic remains fixed and reliable, but its prevalence over time is captured, and topics may thus focus in on co-occurrence patterns that are time-sensitive. Unlike other work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps. Improvements in topic saliency and the ability to predict time given words.

Personalized Summarization[edit]

Can be applied to mobile devices
Perform document summarization ( snippet-generation ) based on user profile.

Describe a user based on his past behavior. This requires NLP techniques.

Use Session data to aid unsupervised clustering[edit]

Each session could be thought as a representative of a user-intent ( aka topic ) as generally ( there are always exceptions) we do related things in a session. Like searching for a travel destination, followed by searching for hotel rates, followed by searching for air-tickets. If we make this assumption, this greatly reduces the problem of high dimensions.

Now instead of clustering based on document/word tuple, we could create a representative vector for each user session ( call it a session-vector in term-space ) and apply clustering logic on session/word tuple. We can use LSH to see the distribution of session-vectors across different hash-buckets, which can be used to estimate the aprior number of clusters for latent variable algorithms like PLSI/LDA etc.

ML Algo resources[edit]

Generic resources[edit]

Andrew Ng's ML Channel

generative Vs discriminative[edit]

Elephant drawing example for Discriminative vs Generative by Chalearn Acad

Logistic Regression[edit]

SVM[edit]

One of the best explanations of linear SVM
- Patrick H Winston Lecture 16 : Learning Support Vector Machines. MIT OpenCourseware 6.034 Fall 2010 Artificial Intelligence
- Anoop M Namboodiri IIIT Hyd prof, VC Dimension
  - @ 47:15 Use standard convex QP solvers to get lagrange multipliers and then use the equations to get w and b which defines the decision boundary.
- Part 2 of above, explaining kernel trick
PPTs
- Jinwei Gu An Introduction to SVM & the kernel trick
Issues
- Choice of kernel
  - Gaussian or polynomial kernel is default
  - if ineffective, more elaborate kernels are needed
  - domain experts can give assistance in formulating appropriate similarity measures
- Choice of kernel parameters
  - e.g. σ in Gaussian kernel
  - σ is the distance between closest points with different classifications
  - In the absence of reliable criteria, applications rely on the use of a validation set or cross-validation to set such parameters.
- Optimization criterion – Hard margin v.s. Soft margin
  - a lengthy series of experiments in which various parameters are tested

Understanding the essence of probability distributions[edit]

Understanding Maximum Likelihood[edit]

Its the basis of probability theory, we use it everyday without realizing it. How do we compute probability : #-possible-outcomes/#-total-outcomes. This intuition/axiom/theorem/rule is based on max likelihood. Lets say we have a Bernoulli's event with n-trials and k-positive outcomes. We say probability of positive outcome is k/n. This comes from max-likelihood. Lets see it how.

Probability of data/likelihood ( L ) = p^k * (1-p)^(n-k)
Which value of p will maximize L. That is given by dL/dp = 0.
Solving dL/dp=0 will give same results as solving d(logL)/dp=0.
Solving d(logL)/dp=0 ==> p = k/n Q.E.D.

Bernoulli type distributions[edit]

Relation between Bernoulli/Binomial/Beta
- Binomial is a special case of Bernoulli.
- When n(#-of-trials) = 1, the binomial distribution is a Bernoulli distribution.
- Beta Distribution is continuous extension of Binomial Distribution
- Beta Distribution is also the conjugate prior of Binomial and Bernoulli Distribution in Bayesian statistics.

Multivariate generalizations of above statements[edit]

Relation between Multinomial/Dirichlet
- Multinomial distribution is a multivariate generalization of the binomial distribution.
- When k(#-of-outcomes) = 2, the multinomial distribution is the binomial distribution.
- Dirichlet distribution is the conjugate prior of the multinomial in Bayesian statistics.
- Dirichlet distribution is the multivariate generalization of the Beta distribution.

Other important relations between distributions[edit]

The binomial distribution converges towards the Poisson distribution as the number of trials goes to infinity while the product np remains fixed.
The Gamma function generalizes the factorial function for non-integer and complex values of n. If n is a positive integer, then

\Gamma (n)=(n-1)!\,

showing the connection to the factorial function.

Just as the gamma function for integers describes factorials, the beta function can define a binomial coefficient after adjusting indices:

{n \choose k}={\frac {1}{(n+1)\mathrm {B} (n-k+1,k+1)}}

Why chose Dirichlet[edit]

Lets have topic = z, doc = d, word = w.

If we assume P(d|z) is Multinomial i.e.

$P(d|z)={n! \over z_{1}!\cdots z_{k}!}p_{1}^{z_{1}}\cdots p_{k}^{z_{k}},{where}\sum _{i=1}^{k}z_{i}=n$

What the above equation means is that in a document there are n (doc-length) slots to be filled by one of the z topics, ranging from $z_{1}$ to $z_{k}$ . P(d|z) is effectively the document distribution and is often referred as just P(d). This is different from p(d) which is the probability of d.

Assuming P(d|z) is a multinomial distribution what are our choices for P(z|d)

From Bayesian statistics :

P ( d | z ) * P ( z ) = P ( z | d ) * P ( d )

P ( z | d ) = P ( d | z ) * P ( z ) / P ( d )

= Multinomial () * P ( z ) / Constant(i.e. Evidence )

Since Dirichlet is a conjugate prior of Multinomial, it is a good choice for P ( z | d ). And since P ( z | d ) is dirichlet, so would be P ( z ) ( from the definition of Conjugate Prior).

CondtionalRandomField (CRF)[edit]

Nice Articles
- http://en.wikipedia.org/wiki/Hidden_Markov_model - Pretty good article on HMMs, understandign of which is extremely important to really understand CRF
- http://www.cs.huji.ac.il/course/2004/learns/ConditionalRandomFields2.ppt

Forward Backward Algorithm
- https://www.youtube.com/watch?v=7zDARfKVm7s

Comparison with HMM
- In HMM observations ( emission entities ) are independent
- Unlike CR which is based on discriminative model, HMM is based on a generative model, which means its based on joint probability distribution p(s,o) and requires a model of p(o)

Comparison with MEMM
- http://www3.cs.stonybrook.edu/~ychoi/cse628/lecture/hmm-memm-crf.pdf
- MEMM is a directed graph, means that hidden-states are dependent ONLY on past history
- Like CRF, MEMM is also based on discriminative model, which means its based on conditional probability distribution p(s|s',o) and doesn't require a model of p(o) as observation (o) is already given.

Hidden Markov Models[edit]

Essentially based on hidden variable model which can transition from 1 variable(state) to another with some probability. These states can be interpreted as some known set states which cannot be directly observed, like mood,affinity etc. This approach can be used to model supervised clustering, where clusters are known set of states and features being represented as transmission probabilities (also known as confusion matrix). In analogy with term-doc text mining parlance documents ( or sentences or browsing behaviors ) can be generated by sampling over terms in sequence. The hidden states could be abstract,intro,content,conclusion for documents ( subject,predicate,verb for sentences and start,browse,end for browsing users ).

An HMM is defined by λ = (M,N,A,B,π), where M=num_observations, N=num_hidden_states, A = NXN self-similarity matrix(state transition probabilities), B = MXN confusion matrix ( state->observable emission probabilities), π = NX1 array of initial hidden state probabilities.

Basic problems/solutions
- Training ==> Given a set of sequence observations O[i]s estimate HMM parameters by maximum likelihood ( i.e. maximize P(O|λ) ) == Baum Welch Algorithm
- Inferencing ==> Given an observable-sequence find out the most likely state-sequence == Viterbi Algo

LocalitySensitiveHashing (LSH)[edit]

CAVEAT :: These are all my personal observations. Appreciate constructive comments.

Interesting technique to do the following

Speeds up Clustering
- It avoids pairwise distance/similarity computation
Near Duplicate Detection
Tries to solve "Curse of Dimensionality" problem
- As query is independent of corpus size
More applicable to multimedia indexing
- Unlike text indexing, they follow QueryByExample hence LSH is directly applicable.
Less applicable to text indexing
- As text-queries are very short ( 1-2 words), hence there will be very few hits as very few documents will have 1-2 words.

Basic LSH Algorithm[edit]

Use functions of the form
- $g(p)=<h_{1}(p),h_{2}(p),\ldots h_{k}(p)>$

Preprocessing:
- Select $g_{1}...g_{L}$
- For all $p\in P$ , hash p to buckets $g_{1}(p)\cdots g_{L}(p)$

Query
- Retrieve the points from buckets $g_{1}(p),g_{2}(p)\cdots$ $g_{1}(p),g_{2}(p)\cdots$ until
  - Total points <= 2L or all
- Return top k-points by doing similarity computation among 2L points

LSH hash functions[edit]

Basic definition can be found in Locality_sensitive_hashing

Hamming Hash Function[edit]

This is the first LSH hash funchion mentioned in Indyk-Motwani'98's original paper on LSH. Seems to be more applicable to compute Image Similarity ( Color Histogram match etc )

Article link : Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality (1998) Indyk-Motwani
Metric : Hamming distance
$h(v)=v_{i}$ , i.e., the i-th bit of vector v in d dimensional Hamming space, where i=1 to d
$Pr[h(v)=h(u)]=1-{\frac {D(v,u)}{d}}$

Minwise Permutation[edit]

Initial intent was to do similarity computation using some probabilistic technique. Later used as an efficient LSH. Set based approach. All the other approaches are vector/space based. Since its set based unlike vector-space approach, it doesn't take care of the relative distribution of term-frequencies in a document. More applicable to do Clustering, Duplicate Detection etc. in text corpus.

Article link : Minwise Independent Permutations Broder et al.
Metric : Jaccard Coefficient
$h(d)=min_{w\in d}\{\pi (w)\}$
$Pr[h(v)=h(u)]=J(v,u)$ , where J = Jaccard Coefficient
Good for text/document similarity

p-Stable Distribution[edit]

Uses properties of stable distribution family. Seems to be the most talked about these days. Wide applications including text-retrieval, (sub)image-retrieval, clustering(text,media), duplicate-detection(text,media). Andoni-Indyk-06 extended the projection space from 1 dimension to t-dimension and introduced leech-lattice LSH based on Voronoi diagrammes and leech-lattice-space.

Good reading articles on p-stable based LSH
- LSH Scheme Based on p-Stable Distributions Indyk et al.
- Applying Hash-based Indexing in Text-based Information Retrieval. Benno Stein and Martin Potthast.

Metric : $l_{p}(v_{1},v_{2})$ Basically it honours $l_{p}^{d}$ norm
$h_{\mathbf {a} ,b}({\boldsymbol {\upsilon }})=\left\lfloor {\frac {\mathbf {a} \cdot \mathbf {v} +b}{r}}\right\rfloor$
$Pr[h(v)=h(u)]=\int _{0}^{r}{\frac {1}{c}}f_{p}(t/c)(1-t/r)\,dt$ $Pr[h(v)=h(u)]=\int _{0}^{r}{\frac {1}{c}}f_{p}(t/c)(1-t/r)\,dt$ , where
- $c=\left\Vert v-u\right\|_{p}$ , and
- $f_{p}(t)$ denote the prob density of the absolute value of the p-stable distribution

Semantic Hashing[edit]

This is based on stacked RBMs with layered pre training ( of RBM layers ) followed by backpropagation based training. This seem to work on input data vectors which are quasi binary and can be projected on a hamming space ( means coefficient-values close to 0 or 1 ). Thats one of the problem in modeling a problem with continuous probability distributions into RBM. Hence clustering based on containment or just presence of a click ( instead of click-count or click-probability ) can be modeled using RBMs.

Nice explanation + Python Code Echen's Implementation
Richard Weiss's explanation [1]
Article Link : Semantic Hashing
Code : Thank's to Benjamin Chung's jaRBM package

MapReduce Variants/Improvisations[edit]

My SVD Observations[edit]

The key to working with SVD of any rectangular matrix A is to consider AA^T and A^TA.
The columns of U, that is t by t, are eigenvectors of AA^T,
The columns of V, that is d by d, are eigenvectors of A^TA.
The singular values on the diagonal of S, that is t by d, are the positive square roots of the nonzero eigenvalues of both AA^T and A^TA.
Computation of EigenValues/Vectors
- First transform the given matrix into an upper-triangular form ( via HousehOlder reflections/Gram-Schmidt Normalization etc. )
- Then use iterative QR algorithm.

Random Indexing[edit]

Random Indexing is a vector space technique that provides an efficient and scalable approximation to distributional similarity problems.

Short tutorial on Random Indexing

 Instead of collecting word-use statistics into a 54,000 x 37,600
 (M x N) words-by-documents matrix F, collect them into a
 54,000 x 4,000 (M x N') matrix G.  How?
 0.  Set G = 0.
 1.  For each document d, make an N'-dimensional (e.g., 4,000-D)
     ternary INDEX VECTOR i_d by placing a small number k of
     -1s and +1s (e.g., ten of each; k = 10) at random among
     the 4,000; the rest will be 0s:
     i_d = 0000000+000...000+000000-000...000,
           that's 4,000 "trits," 3,980 of them 0s.
 2.  Scan through document d.  Every time the word w appears,
     add the index vector i_d to row w of matrix G.
 NOTE: The same procedure will produce the matrix F if the 37,600
    index vectors i_d are 37,600-bit and have only one 1, in the
    d-th position (i.e., unary or LOCAL representation):
    i_d = 00000000000...000+0000000000...000000000000000000000000000,
          that's 37,600 bits, 37,599 of them 0s.

EC2 Quirks[edit]

General Tricks[edit]

Logging into ec2 instance/cluster from a diff machine
- Run ec2-describe-instances and use the public DNS ( aka ec2-x-x-x-x.compute-1.amazonaws.com ) just after the <cluster>-master line.
- ssh -i ~/.ec2_keys/id_rsa-ec2-keypair root@ec2-x-x-x-x.compute-1.amazonaws.com
scp get
- scp -i ~/.ec2_keys/id_rsa-ec2-keypair root@ec2-x-x-x-x.compute-1.amazonaws.com:/root/log.txt .
Can use 'cluster ssh'
- Write a script that uses soap to get a list of all running instances and you can use cluster SSH to connect to and issue commands to all machines simultaneously

Use Oracle data mining[edit]

Summarized from [2]

Use Elasticfox and the following ami to boot instance : “ami-117b9778”
Enable Local Port-Forwarding
- Get ec2-host's local ip address ( using ifconfig ), then run the following ssh-command
- ssh -4 -g -L <local-port-12345>:<local-ip-X.X.X.X>:1521 -i <ec2-keypair-file> root@<ec2-XX-XX-XX-XX.compute-1.amazonaws.com>
To Use SQLPLUS
- ssh -i <ec2-keypair-file> root@<ec2-XX-XX-XX-XX.compute-1.amazonaws.com>
- su - oracle
- sqlplus dmuser/dmuser
To Use OracleDataMiner GUI
- Download Oracle's odminer-GUI tool for Data Mining
- <unzip-dir>/odminer/bin/odminer
- Use following odminer settings :
  - user:dmuser/pwd:dmuser
  - host:localhost/port:<local-port-12345>/sid:odmdb

Regression Analysis[edit]

Basic idea is based on OLS ( ordinary least squares fit ) of straight line. Extended to more than 1 dependent variables ( regressors ). Very applicable to inventory estimation/forecasting problem for any ad-network mobile networks specifically, since the inventories explode in mobile space because of device*carrier*geo aspects. For desktops device*carrier doesn't matter and geo almost remains the same for a particular inventory cell.

Linear Regression[edit]

Simple_linear_regression is the uni-variate form
Linear_regression is a multi-variate extension of the above
- Problems with multiple (linear) regression
  - Not local ==> two completely independent (orthogonal) regressors affecting each other

Scaling the Multiple (Linear) Regression Problem[edit]

Exploit hierarchy in regressing dimensions
Dance between being too macro and too micro
- Too micro ==> unstable regression
  - Exhibit random behavior ( central limit theorem )
- Too macro ==> bad estimation
- Violates locality ( as well collapse the dimensions )
Use LSH
- Control the dance via parameterization of hash functions

TV Advertising[edit]

Can mix contextual with behavioral, based on currently-watching-pgm and past history with appropriate weights given to each of the algorithms.
Can use user's EPG settings to pre-fetch ads for contextual ads, as we know what is the pgm user is gonna watch.
- Use CRF models as compared to Markovian models as we know the future behavior here.

Client based (STB) recommendation[edit]

Needs to cluster different user-groups in the household
- TV is shared, hence superimposition of behaviors. ( kids, family, teens, parental, senior citizen, young-adult etc. )
Identify current user-group ( or a mixture of groups )
- Based on temporal, contextual current data
- Based on current EPG settings ( user1,user2 etc. ), channels watching, frequency of activity, time-of-day etc.
Different user-groups has different cut-off bar for showing recommendations/advertisements :
- For Kids : low bar ==> ( low precision, high recall ). Can show it to anyone. may not be effective for some.
- For Adults : high bar ==> dangerous to show adult advertisements to kids ==>( Extremely High precision, high recall )
- For finance/home-loan ==> moderate bar ( as everyone is moderately interested in finance etc. Low returns when showing it to kids though )

Bidirectional HMM[edit]

Can model as bi-directional HMM
- STB sees the ads sent as observations
- Cable Network sees yes/no from STB as observations

OpenCV building guide[edit]

Go to OpenCV Installation Guide
- Download source code for linux/mac
- Install Cmake 2.8+
  - Download cmake source code
  - Building instructions : ( basically ./bootstrap; make ; make install )
  - Add /opt/cmake/bin/cmake in your $PATH ( thast where cmake is installed by default in linux ). In mac
- Install pkg-config
  - in linux : "yum install pkgconfig"
  - in Mac
    - Download source code
    - Instructions are in INSTALL fiel which comes with the src code ( steps : ./configure; make; make check; make install )
- Configure OpenCV by running cmake
  - cd opencvdir; mkdir release; cd release;
  - In redhat/linux : cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D BUILD_NEW_PYTHON_SUPPORT=ON ..
  - In Mac : cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D BUILD_NEW_PYTHON_SUPPORT=ON -D BUILD_JAVA_SUPPORT=FALSE ..
- sudo make install
For Python support
- I was able to compile successfully for python in Mac ( redhat has some issues I am still debugging )
- In mac ensure that you build with -D BUILD_NEW_PYTHON_SUPPORT=ON option while running cmake.
- after running "make install" , export PYTHONPATH=opencvdir/release/lib:opencvdir/samples/python:$PYTHONPATH ( there should be a cv2.so file under <opencvdir>/release/lib )
- now run python; import cv2;

Random Scratchpad[edit]

Installations on python
- Use pip ( via easy_install )
  - curl -O http://python-distribute.org/distribute_setup.py
  - python distribute_setup.py
  - sudo python distribute_setup.py
  - sudo easy_install -v pip
  - pip search libxml2; pip install lxml
  - sudo pip install sympy==0.7.1
- Use easy_install ( instructions in http://pypi.python.org/pypi/setuptools)
  - curl -O "http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea"
  - mv setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea setuptools-0.6c11-py2.7.egg
  - sh setuptools-0.6c11-py2.7.egg
  - sudo easy_install -v lxml
  - sudo easy_install -v simplejson
Passwordless SSH/SCP logins
- To summarize, a personal private(.ssh/id_rsa)/public(.ssh/id_rsa.pub) key pair is generated using the ssh-keygen command. The public key is then copied onto a remote systems' .ssh/authorized_keys file. And you can now SSH to the remote systems's account without the use of a password. SSH uses client's .ssh/id_rsa as the priavte key and sends auth request to server which then checks its authorized_keys entries. If present it allows pwdless login
- On Client machine :
  - ssh-keygen -t rsa
  - scp .ssh/id_rsa.pub prasen@remotemachine:/homes/prasen/.ssh
  - ssh prasen@remotemachine 'cat /homes/prasen/.ssh/id_rsa.pub >> /homes/prasen/.ssh/authorized_keys'
  - rm .ssh/id_rsa.pub ( Optional )

Mahout QuickstartGuide[edit]

svn co http://svn.apache.org/repos/asf/mahout/trunk
cd trunk; mvn -s /dev/null -P fastinstall ( to avoid your custom settings.xml )
Install and run hadoop in pdeudo-dist mode
Run Clustering of synthetic control data
- curl -o synthetic_control.data https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html;
- hadoop fs -mkdir testdata; hadoop fs -put <dir>/syncthetic_control.data testdata
- export HADOOP_CONF_DIR=~/work/hadoop/hadoop-0.20.203.0/conf;export HADOOP_HOME=~/work/hadoop/hadoop-0.20.203.0
- bin/mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
Step-By-Step-Tutorial links:

MySql on Mac[edit]

Very helpful tips
- https://trac.macports.org/wiki/howto/MAMP
  - sudo chown -R mysql:mysql /opt/local/var/db/mysql5/
  - sudo chown -R mysql:mysql /opt/local/var/run/mysql5/
  - sudo chown -R mysql:mysql /opt/local/var/log/mysql5/
- http://hintsforums.macworld.com/showthread.php?p=616669

GNUPLOT Tips[edit]

Use Gnuplot in Mac
- Install Macports ( macport installs packages in /opt/local/bin )
- sudo port install gnuplot ( gnuplot installed in /opt/local/bin/gnuplot )
- The default aquaterm doesnt work in Mac 10.6
  - xterm;gnuplot;set term xterm; plot 'dat' u 1:2 smooth cumulative; ( to see it on xterm )
  - set term png;set output 'output.png'; plot 'dat' u 1:2 smooth cumulative; ( this will generate png file )
Additional gnuplot tips
- set datafile separator '\t';
Ranked Data plotting
- plot [][26:1] "data" using 2:0:ytic(1) ( http://www.manning.com/janert/SampleCh13.pdf )
- plot [1:100] '/tmp/nonus_sr.txt' u 3:8, '/tmp/us_sr.txt' u 3:8, '/tmp/nonus_hvr.txt'u 3:8, '/tmp/us_hvr.txt' u 3:8 ;
Useful Links :
- http://t16web.lanl.gov/Kawano/gnuplot/datafile-e.html
- http://www.manning.com/janert/SampleCh13.pdf

Solr Tips[edit]

Simple Solr Setup
- cd $solr/example/solr
- ..\..\bin\solr start -f -V
Solr Setup in a new Directory
- cp -r $solr/example/solr new_solr/.
- rm -rf new_solr/solr/collection1/data/*
- <solr-install-dir>\bin\solr start -f -V -s <full-path>\new_solr\solr
Solr Multicore setup ( with sharding )
- cd $solr/example; mkdir -p my-multicore;
- create the folders in my-multicore: ./data/core0, ./data/core1, ./sharedInstanceDir
- cp -r multicore/core0/conf my-multicore/sharedInstanceDir
- create the solr.xml in my-multicore

 <cores adminPath="/admin/cores">
   <core name="core0" instanceDir="$solr/example/my-multicore/sharedInstanceDir" dataDir="$solr/example/my-multicore/data/core0"/>
   <core name="core1" instanceDir="$solr/example/my-multicore/sharedInstanceDir" dataDir="$solr/example/my-multicore/data/core1"/>
 </cores>

- java -Dsolr.solr.home=. -jar ../start.jar
- Search Link
  - ( simple OR search on a specific field ) curl 'http://localhost:8983/solr/select?rows=2&fl=id' --data-urlencode 'q={!q.op=OR df=features_txt} 6173380581647087986'
  - ( searching over all cores ) http://localhost:8983/solr/core0/select?shards=localhost:8983/solr/core0,localhost:8983/solr/core1&q=*:*
- Update Link ( separate for each core ) :
  - curl http://localhost:8983/solr/core0/update/csv?commit=true --data-binary @/tmp/tmp/tmp.txt.0.0 -H "Content-type:text/plain; charset=utf-8"
  - curl http://localhost:8983/solr/core1/update/csv?commit=true --data-binary @/tmp/tmp/tmp.txt.0.0 -H "Content-type:text/plain; charset=utf-8"
- Populating solr with a csv file
  - Make any csv file solr-compatible by adding id to the beginning of each line : nl csv.txt > solrcsv.txt ( It uses TAB as the default separator, use -s ',' for comma )
  - curl "http://localhost:8983/solr/update/csv?commit=true&fieldnames=id,sec_t,slk_t,pos_i,country_t,fct_t,wlpr_t,dac_t,count_l&separator=%09" -H "Content-type:text/plain" --data-binary @solrcsv.txt
  - wget -O - -v "http://localhost:8983/solr/update/csv?commit=true&fieldnames=id,name_txt,count_l,postcodes_t,bb_t,cg_t,imageryurl_t,zoomLevel_t,polygon_t&separator=%09" --header="Content-type:text/plain" --post-file=solr1.csv
  - Verify content by
    - http://localhost:8983/solr/select?q=*:*
    - wget -O - "http://localhost:8983/solr/select?q=name_txt:himalaya&wt=csv"
- Use of Dynamic fields
  - Its mentioned in <solr-home>/conf/schema.xml and you can use it as myfield1_s for string, myfield2_i for int etc.
  - Documentation in http://wiki.apache.org/solr/SchemaXml#Dynamic_fields Use it along with csvupdate. Very handy.
- Delete Link ( separate for each core, also different ways ) :
  - curl http://localhost:8983/solr/update -d 'stream.body=<delete><query>*:*</query></delete>&commit=true'
  - http://localhost:8983/solr/core0/update?stream.body=<delete><query>*:*</query></delete>&commit=true
  - http://localhost:8983/solr/core1/update?stream.body=<delete><query>*:*</query></delete>&commit=true

B-trees Vs LSM trees Vs Fractal trees ( LevelDB/tokutekDB/Cassandra)[edit]

Look at slides around 23. It merges LSM ideas with B-trees by having SStables/Message-buffers at each B-tree node. http://www.tokutek.com/wp-content/uploads/2012/01/20120109-boston-mysql-usergroup_posted.pdf
One of the cleanest explanation of how SSTables/LevelDBs merge ( compaction algo )
Writes/Adds are slow in B-tree, but fast with SSTables ( implementation of LSM tree )
Reads are slow with LSM as reads spread over different SSTables and are all linear ( no binary searches ). Fractal trees try to make reads fast. http://en.wikipedia.org/wiki/Fractional_cascading
Why are the writes of Cassandra executed faster than MySQL’s. InnoDB is using B+trees, while Cassandra is using LSM trees (BigTable is using LSM trees too).
http://nosql.mypopescu.com/post/3063887666/writes-performance-b-tree-lsm-tree-fractal-tree

Cassandra Gyaan[edit]

Git_SVN[edit]

Basic Commands[edit]

Git Pulls
- git pull ( to fetch from master )
- git pull origin <pmukherj_branch> ( to fetch from origin/remote of a different branch)
Git Branch
- git branch ( to see current branches, and also your current branch )
- git branch -a ( to see ALL remote and local branches )
- git checkout <pmukherj_scratch>
Git status
- git status

Python[edit]

Installation Tips[edit]

Use http proxy in python
- urllib2.urlopen('http://google.com', proxies={'http':'127.0.0.1'})
Printing stacktrace in Python
- import traceback
- try: <block>
- catch Excpetion, err:
  - traceback.print_exc(file=sys.stderr)
Pre requisites In WIndows
- Download and install easy_intsall
- Run 'C:\Python27\Scripts\easy_install pip' from commandline
- Install Mingw32 by downloading the executable
- Add C:\Mingw32\bin to PATH ( set PATH=%PATH%;C:\Mingw32\bin )
  - Adding M9ingw32 will fix lotsof python setup/installation problems..

Helpful Coding Tricks[edit]

Profiling in Python
- time python -m cProfile <your_python_code.py>
One liner tricks
- Print last 4 fields of a TSV file
  - python -c "import sys; tmp = lambda x: sys.stdout.write(x.split('\t')[:4]); map(tmp,sys.stdin)"

Python Data Sciences ToolKit[edit]

Install Tools
- set HTTP_PROXY=<host:port> ( if proxy exists )
- First approach
  - easy_install numpy
  - pip3 install numpy
- Alternative( easy ) approach, if the above doesnt work on windows
  - Download pre-compiled package numpy-1.10.4+mkl-cp27-cp27m-win32.whl
    - Choose the correct file compatible to your CPU and Python version ( e.g. numpy-1.9.2+mkl-cp33-none-win_amd64.whl for python 3.3 and 64 bit windows )
  - pip3 install <path-to-file-numpy-1.9.2+mkl-cp33-none-win_amd64.whl>
- Essenstial liraries to install
  - pandas ( pip3 install pandas )
  - numpy
  - scipy
  - scikit-learn
Install Theano/Keras
- Install Python2.7.11
- Install numpy ( for 32 bit )
- install scipy
- git clone https://github.com/Theano/Theano.git
- cd Theano
- python setup.py develop
- git clone https://github.com/fchollet/keras
- cd keras
- python setpu.py install
Theano/Keras Documentation
- http://speech.ee.ntu.edu.tw/~tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.pdf Shorter tutorial
- http://deeplearning.net/software/theano/theano.pdf 500+ page manual
SVM model generation from csv
- Python code

Naive Bayesian ( 2 packages )[edit]

Naive Bayesian scikit
- In Mac ( Because pip installs in /Library/Python/2.6/site-packages and native python uses /System/Library/Frameworks/Python.framework/Versions/2.6/Extras/lib/python]
  - sudo port install py26-scikits-learn
  - /opt/local/bin/python2.6
  - Document classifier code ( 100 features/words, 6 samples)
- In Redhat
  - pip install --upgrade numpy
  - pip install scipy
  - pip install scikit

Pig/Python Tricks[edit]

Using Python one liners as streams
- DEFINE GET_QUERY `python -c "import urlparse, sys; f1 = lambda l:urlparse.parse_qs(l)['r'][0]; f2=lambda u:urlparse.parse_qs(u.split('http://www.flickr.com/search/?')[1])['q'][0];[sys.stdout.write(f2(f1(l))) for l in sys.stdin]"`;
- r1 = stream r through GET_QUERY;
- Testing :
  - echo '/f?s=792600186&t=03bed0e95c1ba45ed897cbcc87f754af&r=http%3A%2F%2Fwww.flickr.com%2Fsearch%2F%3Fq%3Dsix%2Bflags%2Bmagic%2Bmountain%26d%3Dtaken-20040101-20051231%26ct%3D0%26mt%3Dall%26adv%3D1&fl_ev=0&lang=en&intl=us&sview=t&sscope=all&spage=9&sadv=t&sver=2.0' | python -c "import urlparse, sys; f1 = lambda l:urlparse.parse_qs(l)['r'][0];f2=lambda u:urlparse.parse_qs(u.split('http://www.flickr.com/search/?')[1])['q'][0];[sys.stdout.write(f2(f1(l))) for l in sys.stdin]"
Using Python files as streams ( more readable/can print stacktrace etc, but requires an additional file )
- DEFINE GET_QUERY `python test.py`; ( test.py is below)
  - import sys
  - for l in sys.stdin: print l
- r1 = stream r through GET_QUERY;
- Testing :
  - echo '/f?s=792600186&t=03bed0e95c1ba45ed897cbcc87f754af&r=http%3A%2F%2Fwww.flickr.com%2Fsearch%2F%3Fq%3Dsix%2Bflags%2Bmagic%2Bmountain%26d%3Dtaken-20040101-20051231%26ct%3D0%26mt%3Dall%26adv%3D1&fl_ev=0&lang=en&intl=us&sview=t&sscope=all&spage=9&sadv=t&sver=2.0' | python test.py

Merkle/Hash Trees, Gossip Protocol, Anti-Entropy/Read Repair, Hinted handoff[edit]

Anti-entropy, or replica synchronization, is the mechanism in Cassandra for ensuring that data on different nodes is updated to the newest version.
Here’s how it works. During a major compaction (see Compaction), the server initiates a TreeRequest/TreeResponse conversation to exchange Merkle trees with neighboring nodes. The Merkle tree is a hash representing the data in that column family. If the trees from the different nodes don’t match, then they have to be reconciled (or “repaired”) in order to determine the latest data values they should all be set to.
- In Dynamo, they use a Merkle tree for anti-entropy (see Merkle Tree). Cassandra does too, but the implementation is a little different. In Cassandra, each column family has its own Merkle tree; the tree is created as a snapshot during a major compaction operation, and it is kept only as long as is required to send it to the neighboring nodes on the ring. The advantage of this implementation is that it reduces disk I/O.
Cassandra uses Bloom filters to reduce disk access, which can be expensive, on key lookups. Every SSTable has an associated Bloom filter; when a query is performed, the Bloom filter is checked first before accessing disk. Bloom filters CANNOT have false negatives,
Gossip protocol is used in consistent hashing to slowly move repartitioned data from old shard to new shard ( combining read-repair techniques )
- http://www.inf.u-szeged.hu/~jelasity/talks/saso07tutorial.pdf

Excel Tricks[edit]

Create GUID in Excel
- =CONCATENATE(DEC2HEX(RANDBETWEEN(0,4294967295),8),"-",DEC2HEX(RANDBETWEEN(0,42949),4),"-",DEC2HEX(RANDBETWEEN(0,42949),4),"-",DEC2HEX(RANDBETWEEN(0,42949),4),"-",DEC2HEX(RANDBETWEEN(0,4294967295),8),DEC2HEX(RANDBETWEEN(0,42949),4))

Shell Tricks[edit]

Set JAVA_HOME in mac : export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/CurrentJDK/Home
for i in `ls *.jpg`; do /usr/bin/time -o t curl -s -o /dev/null "http://localhost:10080/getsimilartojpeg?numresults=1&filename=/home/prasen/SmartVision/data/books_image/$i"; done
time time curl "http://localhost:10080/getsimilartojpeg?numresults=1&filename=/home/prasen/SmartVision/data/books_image/1111111-M.jpg

r = load 'flickr_dump'; r = foreach r generate $0; r = limit r 1000; DEFINE GET_QUERY `python -c "import urlparse, sys; f1 = lambda l:urlparse.parse_qs(l)['r'][0];f2=lambda u:urlparse.parse_qs(u.split('http://www.flickr.com/search/?')[1])['q'][0];[sys.stdout.write(f2(f1(l))) for l in sys.stdin]"`; r1 = stream r through GET_QUERY; STORE r1 INTO '/tmp/save.txt';

Important References[edit]

My Understanding on Lucene Internals
Parallel Matrix Algorithms
Convolutional Neural Networks
- Good video series explanation of Convolutional Neural Network
- Standford's explanation

Spatial Data Mining[edit]

The Data Mining Trinity Tasks[edit]

classification
- for prediction, thematic classification
- spatial regression ( SAR )
- Markov Random field
clustering
- Identify hotspots, outliers
- DBSCAN, hierarchical,
association rules
- for relationship discover a --> b ==> X --> Y

Spatial Measures[edit]

Mean, Dispersion
Spatial Dependence. Moran's I A global measure of spatial autocorrelation.

Spatial Statistical Models[edit]

Point Process
Lattices
Geostatistics. To model spatial variability using variograms. Variograms summarizes the relationship between differences in pairs of measurements and the distance between corresponding pairs. Krging is another estimation procesdure used in geostatistics.

Data Science Interview Questions[edit]

Explain hypothesis testing and what is a p-value
What do you mean by a ML model. What exactly is learned and how. Explain with an example.
What is Maximum log likelihood and where it is used
Estimation of Pi using Monte Carlo SImulation
Probability of 3 ants colliding, starting from 3 corners of an equilateral triangle ( 2/8 )
Probability of an element present in a sample of size r out of a total s elements
- http://math.stackexchange.com/questions/489772/probability-of-sampling-with-and-without-replacement
K samples from an infinite stream
Interesting Algorithmic problems
- http://gregable.com/search/label/algorithms
https://medium.com/rants-on-machine-learning
https://www.mapr.com/blog/some-important-streaming-algorithms-you-should-know-about
- Hashing and Sketching
- Hyper log log = count distinct
- Count min = count(s)
- Streaming k-means
- Quantiles via t-digest