Jump to content

User:Prasenjitmukherjee

From Wikipedia, the free encyclopedia

tPrasen Mukherjee

[edit]

I live in Hyderabad and work for Bing!. I am interested in data mining, mathematics, physics, religion and lots of other random topics.

Resume

[edit]

What do I do

[edit]
  • Clustering in Mobile Ad Network Space
    • Extended Fielder Method ( Based on SVD decomposition )
    • Stacked RBMs ( aka Restricted Boltzmann Machine )

Earlier Work

[edit]

Architecting Mapquest/Mobile Platform Solutions

[edit]
  • Mapquest
    • Single Line Geo Search
      • Extracting contextual location based on short term browser behavior
    • Mood based route recommendation
    • WebTileCache
  • Mobile AIM Platform
    • Wireless Village/IMPS

Spearheading R&D initiatives

[edit]
  • Leading R&D initiative by collaborating with premier research inistitutes
  • Leading open source Machine Learning initiatives (http://lucene.apache.org/mahout)

Ad monetization (Mobile/Social-Networking)

[edit]
  • Interact/Influence different business units to aid revenue-generation and monetization
  • Apply technology to gauge trends and user-behaviour
    • Spot market opportunities
[edit]

Recommendation System ( common for Mobile/Desktop)

[edit]
  • Recommend (news) articles
  • Recommend upcoming events
  • Personalized browsing - Idea here is to learn the immediate intent of user dynamically ( in the current session as he broswes more and more pages ) from his past history, and recommend articles based on that.

Temporal profiling of user log data

[edit]
  • Analyze distribution of personal topics ( probably extracted via PLSI,LDA etc.) over time
  • Use the above distribution data to group temporally similar users
  • Reference papers :
    • Andrew McCallum's Publications
    • A Continuous-Time Model of Topic Co-occurrence Trends. Xuerui Wang, Wei Li, and Andrew McCallum. AAAI Workshop on Event Detection, 2006. Capture the time distributions not only of a topics, but also of their co-occurrences. For example, notice that while NLP and ML have both been around for a long time, but their co-occurrence has been rising recently. The model is effectively a combination of the Pachinko Allocation Model (PAM) and Topics-Over-Time (TOT).
    • Topics over Time: A Non-Markov Continuous-Time Model of Topical Trends. Xuerui Wang and Andrew McCallum. Conference on Knowledge Discovery and Data Mining (KDD) 2006. A new LDA-style topic model that models trends over time. The meaning of a topic remains fixed and reliable, but its prevalence over time is captured, and topics may thus focus in on co-occurrence patterns that are time-sensitive. Unlike other work that relies on Markov assumptions or discretization of time, here each topic is associated with a continuous distribution over timestamps. Improvements in topic saliency and the ability to predict time given words.

Personalized Summarization

[edit]
  • Can be applied to mobile devices
  • Perform document summarization ( snippet-generation ) based on user profile.
  • Describe a user based on his past behavior. This requires NLP techniques.

Use Session data to aid unsupervised clustering

[edit]
  • Each session could be thought as a representative of a user-intent ( aka topic ) as generally ( there are always exceptions) we do related things in a session. Like searching for a travel destination, followed by searching for hotel rates, followed by searching for air-tickets. If we make this assumption, this greatly reduces the problem of high dimensions.
  • Now instead of clustering based on document/word tuple, we could create a representative vector for each user session ( call it a session-vector in term-space ) and apply clustering logic on session/word tuple. We can use LSH to see the distribution of session-vectors across different hash-buckets, which can be used to estimate the aprior number of clusters for latent variable algorithms like PLSI/LDA etc.

ML Algo resources

[edit]

Generic resources

[edit]

generative Vs discriminative

[edit]

Logistic Regression

[edit]

SVM

[edit]

Understanding the essence of probability distributions

[edit]

Understanding Maximum Likelihood

[edit]

Its the basis of probability theory, we use it everyday without realizing it. How do we compute probability : #-possible-outcomes/#-total-outcomes. This intuition/axiom/theorem/rule is based on max likelihood. Lets say we have a Bernoulli's event with n-trials and k-positive outcomes. We say probability of positive outcome is k/n. This comes from max-likelihood. Lets see it how.

  • Probability of data/likelihood ( L ) = p^k * (1-p)^(n-k)
  • Which value of p will maximize L. That is given by dL/dp = 0.
  • Solving dL/dp=0 will give same results as solving d(logL)/dp=0.
  • Solving d(logL)/dp=0 ==> p = k/n Q.E.D.

Bernoulli type distributions

[edit]
  • Relation between Bernoulli/Binomial/Beta
    • Binomial is a special case of Bernoulli.
    • When n(#-of-trials) = 1, the binomial distribution is a Bernoulli distribution.
    • Beta Distribution is continuous extension of Binomial Distribution
    • Beta Distribution is also the conjugate prior of Binomial and Bernoulli Distribution in Bayesian statistics.

Multivariate generalizations of above statements

[edit]

Other important relations between distributions

[edit]
  • The binomial distribution converges towards the Poisson distribution as the number of trials goes to infinity while the product np remains fixed.
  • The Gamma function generalizes the factorial function for non-integer and complex values of n. If n is a positive integer, then

showing the connection to the factorial function.

Why chose Dirichlet

[edit]

Lets have topic = z, doc = d, word = w.

If we assume P(d|z) is Multinomial i.e.

What the above equation means is that in a document there are n (doc-length) slots to be filled by one of the z topics, ranging from to . P(d|z) is effectively the document distribution and is often referred as just P(d). This is different from p(d) which is the probability of d.

Assuming P(d|z) is a multinomial distribution what are our choices for P(z|d)

From Bayesian statistics :
P ( d | z ) * P ( z ) = P ( z | d ) * P ( d )
P ( z | d ) = P ( d | z ) * P ( z ) / P ( d )
= Multinomial () * P ( z ) / Constant(i.e. Evidence )

Since Dirichlet is a conjugate prior of Multinomial, it is a good choice for P ( z | d ). And since P ( z | d ) is dirichlet, so would be P ( z ) ( from the definition of Conjugate Prior).

CondtionalRandomField (CRF)

[edit]
  • Comparison with HMM
    • In HMM observations ( emission entities ) are independent
    • Unlike CR which is based on discriminative model, HMM is based on a generative model, which means its based on joint probability distribution p(s,o) and requires a model of p(o)
  • Comparison with MEMM
    • http://www3.cs.stonybrook.edu/~ychoi/cse628/lecture/hmm-memm-crf.pdf
    • MEMM is a directed graph, means that hidden-states are dependent ONLY on past history
    • Like CRF, MEMM is also based on discriminative model, which means its based on conditional probability distribution p(s|s',o) and doesn't require a model of p(o) as observation (o) is already given.

Hidden Markov Models

[edit]

Essentially based on hidden variable model which can transition from 1 variable(state) to another with some probability. These states can be interpreted as some known set states which cannot be directly observed, like mood,affinity etc. This approach can be used to model supervised clustering, where clusters are known set of states and features being represented as transmission probabilities (also known as confusion matrix). In analogy with term-doc text mining parlance documents ( or sentences or browsing behaviors ) can be generated by sampling over terms in sequence. The hidden states could be abstract,intro,content,conclusion for documents ( subject,predicate,verb for sentences and start,browse,end for browsing users ).

  • An HMM is defined by λ = (M,N,A,B,π), where M=num_observations, N=num_hidden_states, A = NXN self-similarity matrix(state transition probabilities), B = MXN confusion matrix ( state->observable emission probabilities), π = NX1 array of initial hidden state probabilities.
  • Basic problems/solutions
    • Training ==> Given a set of sequence observations O[i]s estimate HMM parameters by maximum likelihood ( i.e. maximize P(O|λ) ) == Baum Welch Algorithm
    • Inferencing ==> Given an observable-sequence find out the most likely state-sequence == Viterbi Algo

LocalitySensitiveHashing (LSH)

[edit]

CAVEAT :: These are all my personal observations. Appreciate constructive comments.


Interesting technique to do the following

  • Speeds up Clustering
    • It avoids pairwise distance/similarity computation
  • Near Duplicate Detection
  • Tries to solve "Curse of Dimensionality" problem
    • As query is independent of corpus size
  • More applicable to multimedia indexing
    • Unlike text indexing, they follow QueryByExample hence LSH is directly applicable.
  • Less applicable to text indexing
    • As text-queries are very short ( 1-2 words), hence there will be very few hits as very few documents will have 1-2 words.

Basic LSH Algorithm

[edit]
  • Use functions of the form
  • Preprocessing:
    • Select
    • For all , hash p to buckets
  • Query
    • Retrieve the points from buckets until
      • Total points <= 2L or all
    • Return top k-points by doing similarity computation among 2L points


LSH hash functions

[edit]

Basic definition can be found in Locality_sensitive_hashing

Hamming Hash Function

[edit]

This is the first LSH hash funchion mentioned in Indyk-Motwani'98's original paper on LSH. Seems to be more applicable to compute Image Similarity ( Color Histogram match etc )

Minwise Permutation

[edit]

Initial intent was to do similarity computation using some probabilistic technique. Later used as an efficient LSH. Set based approach. All the other approaches are vector/space based. Since its set based unlike vector-space approach, it doesn't take care of the relative distribution of term-frequencies in a document. More applicable to do Clustering, Duplicate Detection etc. in text corpus.

  • Article link : Minwise Independent Permutations Broder et al.
  • Metric : Jaccard Coefficient
  • , where J = Jaccard Coefficient
  • Good for text/document similarity

p-Stable Distribution

[edit]

Uses properties of stable distribution family. Seems to be the most talked about these days. Wide applications including text-retrieval, (sub)image-retrieval, clustering(text,media), duplicate-detection(text,media). Andoni-Indyk-06 extended the projection space from 1 dimension to t-dimension and introduced leech-lattice LSH based on Voronoi diagrammes and leech-lattice-space.

  • Metric : Basically it honours norm
  • , where
    • , and
    • denote the prob density of the absolute value of the p-stable distribution

Semantic Hashing

[edit]

This is based on stacked RBMs with layered pre training ( of RBM layers ) followed by backpropagation based training. This seem to work on input data vectors which are quasi binary and can be projected on a hamming space ( means coefficient-values close to 0 or 1 ). Thats one of the problem in modeling a problem with continuous probability distributions into RBM. Hence clustering based on containment or just presence of a click ( instead of click-count or click-probability ) can be modeled using RBMs.

MapReduce Variants/Improvisations

[edit]

My SVD Observations

[edit]
  • The key to working with SVD of any rectangular matrix A is to consider AAT and ATA.
  • The columns of U, that is t by t, are eigenvectors of AAT,
  • The columns of V, that is d by d, are eigenvectors of ATA.
  • The singular values on the diagonal of S, that is t by d, are the positive square roots of the nonzero eigenvalues of both AAT and ATA.
  • Computation of EigenValues/Vectors
    • First transform the given matrix into an upper-triangular form ( via HousehOlder reflections/Gram-Schmidt Normalization etc. )
    • Then use iterative QR algorithm.

Random Indexing

[edit]

Random Indexing is a vector space technique that provides an efficient and scalable approximation to distributional similarity problems.

 Instead of collecting word-use statistics into a 54,000 x 37,600
 (M x N) words-by-documents matrix F, collect them into a
 54,000 x 4,000 (M x N') matrix G.  How?
 0.  Set G = 0.
 1.  For each document d, make an N'-dimensional (e.g., 4,000-D)
     ternary INDEX VECTOR i_d by placing a small number k of
     -1s and +1s (e.g., ten of each; k = 10) at random among
     the 4,000; the rest will be 0s:
     i_d = 0000000+000...000+000000-000...000,
           that's 4,000 "trits," 3,980 of them 0s.
 2.  Scan through document d.  Every time the word w appears,
     add the index vector i_d to row w of matrix G.
 NOTE: The same procedure will produce the matrix F if the 37,600
    index vectors i_d are 37,600-bit and have only one 1, in the
    d-th position (i.e., unary or LOCAL representation):
    i_d = 00000000000...000+0000000000...000000000000000000000000000,
          that's 37,600 bits, 37,599 of them 0s.

EC2 Quirks

[edit]

General Tricks

[edit]
  • Logging into ec2 instance/cluster from a diff machine
    • Run ec2-describe-instances and use the public DNS ( aka ec2-x-x-x-x.compute-1.amazonaws.com ) just after the <cluster>-master line.
    • ssh -i ~/.ec2_keys/id_rsa-ec2-keypair root@ec2-x-x-x-x.compute-1.amazonaws.com
  • scp get
    • scp -i ~/.ec2_keys/id_rsa-ec2-keypair root@ec2-x-x-x-x.compute-1.amazonaws.com:/root/log.txt .
  • Can use 'cluster ssh'
    • Write a script that uses soap to get a list of all running instances and you can use cluster SSH to connect to and issue commands to all machines simultaneously

Use Oracle data mining

[edit]

Summarized from [2]

  • Use Elasticfox and the following ami to boot instance : “ami-117b9778”
  • Enable Local Port-Forwarding
    • Get ec2-host's local ip address ( using ifconfig ), then run the following ssh-command
    • ssh -4 -g -L <local-port-12345>:<local-ip-X.X.X.X>:1521 -i <ec2-keypair-file> root@<ec2-XX-XX-XX-XX.compute-1.amazonaws.com>
  • To Use SQLPLUS
    • ssh -i <ec2-keypair-file> root@<ec2-XX-XX-XX-XX.compute-1.amazonaws.com>
    • su - oracle
    • sqlplus dmuser/dmuser
  • To Use OracleDataMiner GUI
    • Download Oracle's odminer-GUI tool for Data Mining
    • <unzip-dir>/odminer/bin/odminer
    • Use following odminer settings :
      • user:dmuser/pwd:dmuser
      • host:localhost/port:<local-port-12345>/sid:odmdb

Regression Analysis

[edit]

Basic idea is based on OLS ( ordinary least squares fit ) of straight line. Extended to more than 1 dependent variables ( regressors ). Very applicable to inventory estimation/forecasting problem for any ad-network mobile networks specifically, since the inventories explode in mobile space because of device*carrier*geo aspects. For desktops device*carrier doesn't matter and geo almost remains the same for a particular inventory cell.

Linear Regression

[edit]
  • Simple_linear_regression is the uni-variate form
  • Linear_regression is a multi-variate extension of the above
    • Problems with multiple (linear) regression
      • Not local ==> two completely independent (orthogonal) regressors affecting each other

Scaling the Multiple (Linear) Regression Problem

[edit]
  • Exploit hierarchy in regressing dimensions
  • Dance between being too macro and too micro
    • Too micro ==> unstable regression
      • Exhibit random behavior ( central limit theorem )
    • Too macro ==> bad estimation
    • Violates locality ( as well collapse the dimensions )
  • Use LSH
    • Control the dance via parameterization of hash functions

TV Advertising

[edit]
  • Can mix contextual with behavioral, based on currently-watching-pgm and past history with appropriate weights given to each of the algorithms.
  • Can use user's EPG settings to pre-fetch ads for contextual ads, as we know what is the pgm user is gonna watch.
    • Use CRF models as compared to Markovian models as we know the future behavior here.

Client based (STB) recommendation

[edit]
  • Needs to cluster different user-groups in the household
    • TV is shared, hence superimposition of behaviors. ( kids, family, teens, parental, senior citizen, young-adult etc. )
  • Identify current user-group ( or a mixture of groups )
    • Based on temporal, contextual current data
    • Based on current EPG settings ( user1,user2 etc. ), channels watching, frequency of activity, time-of-day etc.
  • Different user-groups has different cut-off bar for showing recommendations/advertisements :
    • For Kids  : low bar ==> ( low precision, high recall ). Can show it to anyone. may not be effective for some.
    • For Adults : high bar ==> dangerous to show adult advertisements to kids ==>( Extremely High precision, high recall )
    • For finance/home-loan ==> moderate bar ( as everyone is moderately interested in finance etc. Low returns when showing it to kids though )

Bidirectional HMM

[edit]
  • Can model as bi-directional HMM
    • STB sees the ads sent as observations
    • Cable Network sees yes/no from STB as observations

OpenCV building guide

[edit]
  • Go to OpenCV Installation Guide
    • Download source code for linux/mac
    • Install Cmake 2.8+
      • Download cmake source code
      • Building instructions : ( basically ./bootstrap; make ; make install )
      • Add /opt/cmake/bin/cmake in your $PATH ( thast where cmake is installed by default in linux ). In mac
    • Install pkg-config
      • in linux : "yum install pkgconfig"
      • in Mac
        • Download source code
        • Instructions are in INSTALL fiel which comes with the src code ( steps : ./configure; make; make check; make install )
    • Configure OpenCV by running cmake
      • cd opencvdir; mkdir release; cd release;
      • In redhat/linux : cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D BUILD_NEW_PYTHON_SUPPORT=ON ..
      • In Mac : cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D BUILD_NEW_PYTHON_SUPPORT=ON -D BUILD_JAVA_SUPPORT=FALSE ..
    • sudo make install
  • For Python support
    • I was able to compile successfully for python in Mac ( redhat has some issues I am still debugging )
    • In mac ensure that you build with -D BUILD_NEW_PYTHON_SUPPORT=ON option while running cmake.
    • after running "make install" , export PYTHONPATH=opencvdir/release/lib:opencvdir/samples/python:$PYTHONPATH ( there should be a cv2.so file under <opencvdir>/release/lib )
    • now run python; import cv2;

Random Scratchpad

[edit]
  • Installations on python
  • Passwordless SSH/SCP logins
    • To summarize, a personal private(.ssh/id_rsa)/public(.ssh/id_rsa.pub) key pair is generated using the ssh-keygen command. The public key is then copied onto a remote systems' .ssh/authorized_keys file. And you can now SSH to the remote systems's account without the use of a password. SSH uses client's .ssh/id_rsa as the priavte key and sends auth request to server which then checks its authorized_keys entries. If present it allows pwdless login
    • On Client machine :
      • ssh-keygen -t rsa
      • scp .ssh/id_rsa.pub prasen@remotemachine:/homes/prasen/.ssh
      • ssh prasen@remotemachine 'cat /homes/prasen/.ssh/id_rsa.pub >> /homes/prasen/.ssh/authorized_keys'
      • rm .ssh/id_rsa.pub ( Optional )

Mahout QuickstartGuide

[edit]

MySql on Mac

[edit]

GNUPLOT Tips

[edit]
  • Use Gnuplot in Mac
    • Install Macports ( macport installs packages in /opt/local/bin )
    • sudo port install gnuplot ( gnuplot installed in /opt/local/bin/gnuplot )
    • The default aquaterm doesnt work in Mac 10.6
      • xterm;gnuplot;set term xterm; plot 'dat' u 1:2 smooth cumulative; ( to see it on xterm )
      • set term png;set output 'output.png'; plot 'dat' u 1:2 smooth cumulative; ( this will generate png file )
  • Additional gnuplot tips
    • set datafile separator '\t';
  • Ranked Data plotting
  • Useful Links :

Solr Tips

[edit]
  • Simple Solr Setup
    • cd $solr/example/solr
    • ..\..\bin\solr start -f -V
  • Solr Setup in a new Directory
    • cp -r $solr/example/solr new_solr/.
    • rm -rf new_solr/solr/collection1/data/*
    • <solr-install-dir>\bin\solr start -f -V -s <full-path>\new_solr\solr
  • Solr Multicore setup ( with sharding )
    • cd $solr/example; mkdir -p my-multicore;
    • create the folders in my-multicore: ./data/core0, ./data/core1, ./sharedInstanceDir
    • cp -r multicore/core0/conf my-multicore/sharedInstanceDir
    • create the solr.xml in my-multicore
 <cores adminPath="/admin/cores">
   <core name="core0" instanceDir="$solr/example/my-multicore/sharedInstanceDir" dataDir="$solr/example/my-multicore/data/core0"/>
   <core name="core1" instanceDir="$solr/example/my-multicore/sharedInstanceDir" dataDir="$solr/example/my-multicore/data/core1"/>
 </cores>

B-trees Vs LSM trees Vs Fractal trees ( LevelDB/tokutekDB/Cassandra)

[edit]

Cassandra Gyaan

[edit]

Git_SVN

[edit]

Basic Commands

[edit]
  • Git Pulls
    • git pull ( to fetch from master )
    • git pull origin <pmukherj_branch> ( to fetch from origin/remote of a different branch)
  • Git Branch
    • git branch ( to see current branches, and also your current branch )
    • git branch -a ( to see ALL remote and local branches )
    • git checkout <pmukherj_scratch>
  • Git status
    • git status

Python

[edit]

Installation Tips

[edit]

Helpful Coding Tricks

[edit]
  • Profiling in Python
    • time python -m cProfile <your_python_code.py>
  • One liner tricks
    • Print last 4 fields of a TSV file
      • python -c "import sys; tmp = lambda x: sys.stdout.write(x.split('\t')[:4]); map(tmp,sys.stdin)"

Python Data Sciences ToolKit

[edit]

Naive Bayesian ( 2 packages )

[edit]

Pig/Python Tricks

[edit]
  • Using Python one liners as streams
    • DEFINE GET_QUERY `python -c "import urlparse, sys; f1 = lambda l:urlparse.parse_qs(l)['r'][0]; f2=lambda u:urlparse.parse_qs(u.split('http://www.flickr.com/search/?')[1])['q'][0];[sys.stdout.write(f2(f1(l))) for l in sys.stdin]"`;
    • r1 = stream r through GET_QUERY;
    • Testing :
      • echo '/f?s=792600186&t=03bed0e95c1ba45ed897cbcc87f754af&r=http%3A%2F%2Fwww.flickr.com%2Fsearch%2F%3Fq%3Dsix%2Bflags%2Bmagic%2Bmountain%26d%3Dtaken-20040101-20051231%26ct%3D0%26mt%3Dall%26adv%3D1&fl_ev=0&lang=en&intl=us&sview=t&sscope=all&spage=9&sadv=t&sver=2.0' | python -c "import urlparse, sys; f1 = lambda l:urlparse.parse_qs(l)['r'][0];f2=lambda u:urlparse.parse_qs(u.split('http://www.flickr.com/search/?')[1])['q'][0];[sys.stdout.write(f2(f1(l))) for l in sys.stdin]"
  • Using Python files as streams ( more readable/can print stacktrace etc, but requires an additional file )
    • DEFINE GET_QUERY `python test.py`; ( test.py is below)
      • import sys
      • for l in sys.stdin: print l
    • r1 = stream r through GET_QUERY;
    • Testing :
      • echo '/f?s=792600186&t=03bed0e95c1ba45ed897cbcc87f754af&r=http%3A%2F%2Fwww.flickr.com%2Fsearch%2F%3Fq%3Dsix%2Bflags%2Bmagic%2Bmountain%26d%3Dtaken-20040101-20051231%26ct%3D0%26mt%3Dall%26adv%3D1&fl_ev=0&lang=en&intl=us&sview=t&sscope=all&spage=9&sadv=t&sver=2.0' | python test.py

Merkle/Hash Trees, Gossip Protocol, Anti-Entropy/Read Repair, Hinted handoff

[edit]
  • Anti-entropy, or replica synchronization, is the mechanism in Cassandra for ensuring that data on different nodes is updated to the newest version.
  • Here’s how it works. During a major compaction (see Compaction), the server initiates a TreeRequest/TreeResponse conversation to exchange Merkle trees with neighboring nodes. The Merkle tree is a hash representing the data in that column family. If the trees from the different nodes don’t match, then they have to be reconciled (or “repaired”) in order to determine the latest data values they should all be set to.
    • In Dynamo, they use a Merkle tree for anti-entropy (see Merkle Tree). Cassandra does too, but the implementation is a little different. In Cassandra, each column family has its own Merkle tree; the tree is created as a snapshot during a major compaction operation, and it is kept only as long as is required to send it to the neighboring nodes on the ring. The advantage of this implementation is that it reduces disk I/O.
  • Cassandra uses Bloom filters to reduce disk access, which can be expensive, on key lookups. Every SSTable has an associated Bloom filter; when a query is performed, the Bloom filter is checked first before accessing disk. Bloom filters CANNOT have false negatives,
  • Gossip protocol is used in consistent hashing to slowly move repartitioned data from old shard to new shard ( combining read-repair techniques )

Excel Tricks

[edit]
  • Create GUID in Excel
    • =CONCATENATE(DEC2HEX(RANDBETWEEN(0,4294967295),8),"-",DEC2HEX(RANDBETWEEN(0,42949),4),"-",DEC2HEX(RANDBETWEEN(0,42949),4),"-",DEC2HEX(RANDBETWEEN(0,42949),4),"-",DEC2HEX(RANDBETWEEN(0,4294967295),8),DEC2HEX(RANDBETWEEN(0,42949),4))

Shell Tricks

[edit]

r = load 'flickr_dump'; r = foreach r generate $0; r = limit r 1000; DEFINE GET_QUERY `python -c "import urlparse, sys; f1 = lambda l:urlparse.parse_qs(l)['r'][0];f2=lambda u:urlparse.parse_qs(u.split('http://www.flickr.com/search/?')[1])['q'][0];[sys.stdout.write(f2(f1(l))) for l in sys.stdin]"`; r1 = stream r through GET_QUERY; STORE r1 INTO '/tmp/save.txt';

Important References

[edit]

Spatial Data Mining

[edit]

The Data Mining Trinity Tasks

[edit]
  • classification
    • for prediction, thematic classification
    • spatial regression ( SAR )
    • Markov Random field
  • clustering
    • Identify hotspots, outliers
    • DBSCAN, hierarchical,
  • association rules
    • for relationship discover a --> b ==> X --> Y

Spatial Measures

[edit]
  • Mean, Dispersion
  • Spatial Dependence. Moran's I A global measure of spatial autocorrelation.

Spatial Statistical Models

[edit]
  • Point Process
  • Lattices
  • Geostatistics. To model spatial variability using variograms. Variograms summarizes the relationship between differences in pairs of measurements and the distance between corresponding pairs. Krging is another estimation procesdure used in geostatistics.

Data Science Interview Questions

[edit]