Wednesday, December 5, 2012



As Mahout in Action become older its nice to have a look in below link

http://sujitpal.blogspot.in/2012/09/learning-mahout-clustering.html



For Calculating the distance between two words

http://www.dotnetperls.com/levenshtein

Tuesday, November 27, 2012

Clustering Commands

Clustering :
cd mahout0.6



Sequencial File Generation

bin/mahout seqdirectory -i /home/Textfiles/ -o /home/SequenceFiles/ -c UTF-8 -chunk 64

 Term Vector Creation.

bin/mahout seq2sparse -i /home/SequenceFiles/ -o /home/SequenceFiles-sparse --maxDFPercent 85 --namedVector --minDF 15


K means Clustering

bin/mahout kmeans -i /home/SequenceFiles-sparse/tfidf-vectors/ -c /home/kmeans-clusters -o /home/kmeans -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 10 -ow --clustering


ClusterDumper

bin/mahout clusterdump -s hdfs://<<host name>>:9000/home/kmeans/clusters-2-final/ -d hdfs://<<host name>>:9000/home/SequenceFiles-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 100

Basic of Mahout

 
Mahout is basically Machine Learning algorithms which solve 3 major problems.
  1. Recommendation.
  2. Clustering
  3. Classification
Recommendation :
Recommendation will recommend a similar taste of items where the user is really interested in . Basically Recommendation is done by based on the user activity based on history. In general there are 3 types of recommendation
  • User Based
  • Item Based
  • Content Based
User Based Recommendation:
Lets take real time example as amazon book purchase. when a user purchase any books in amazon, Amazon guys are recommending some more items along with that which are similar to the user taste


Item Based Recommendation:
Real time Example is Facebook recommends a friends for you. If you noticed the friends which they are recommending with be some what known the user.


Clustering :
Clustering is a process of grouping the text documents into groups of topically related documents.Clustering done based on TF-IDF
  • K-Means
  • Mean Shifting
  • Fuzzy K-Means

Classification :
Classification learns from exisiting categorized documents what documents of a specific category look like and is able to assign unlabelled documents to the (hopefully) correct category.


Above all the Methods are readily available . But our main work is to preparing the dataset in a proper way in which it can produce the efficient result.




Mahout Installation Guide

Mahout Installation Guide :

 Mahout installation is pretty much easy once you found Hadoop is working fine . Setting up mahout will become so easy task .

Step 1: Check whether the Hadoop is Working fine .

Step 2: Check  Whether JAVA_HOME is Set properly(echo $JAVA_HOME)

Step 3:Check  Whether HADOOP_HOME and HADOOP_CONF_DIR is set