Saturday, December 22, 2012

Exploring Google Analytics data with Clojure, Incanter and MongoDB Part 2

Now we have some helper functions. Let's use them to import and process Google Analytics CSV export file. In my last post we defined initdb function. Please set path_to_dir according to your environment in util.clj at defn initdb.
 (prn (:out (clojure.java.shell/sh "/path_to_dir/mediana/00import2mongo.sh" "/path_to_dir/Analytics-All-Traffic-20121120-20121220.csv" "c5" "m1")))

desktop.clj

We evaulate functions interactively. Open mediana/src/mediana/core.clj in Emacs. M-x clojure-jack-in starts a slime session and a repl. pressing C-x C-e right after a closing clojure evaluates expression or M-x slime-eval-last-expression. A whole file can be evaluated by pressing C-c C-k or M-x slime-compile-and-load-file. A active region can be evaluated by C-c C-r or M-x slime-eval-region.
(in-ns 'mediana.core)
;;:SourceMedium :PagesVisit :Visits :AvgVisitDuration :NewVisits :BounceRate                                                                                          
;; Desktop                                                                                                                                                                       
(defn nopfn []
 ;; Import csv into :m1 then process it into :m2                                                                                                                                
  (initdb)
  ;; Fetch "google /organic" data from :m1                                                                                                                                       
  (fdb :m1 :where {:SourceMedium "google / organic" })
  ;; Result:                                                                                                                                                                             
  [:_id :SourceMedium :Visits :PagesVisit :AvgVisitDuration :NewVisits :BounceRate]
  [# "google / organic" "25,104" 5.38 "00:05:32" "39.93%" "33.42%"]
  ;; What is the index of "google / organic" string in :SourceMedium row?                                                                                                        
  (iof "google / organic" (fdr :m1 :SourceMedium))
  ;; 352                                                                                                                                                                         
  ;; Get record from :m2                                                                                                                                                         
  (fdb :m2 :where {:SourceMedium 352})
  ;; Result:                                                                                                                                                    
  [:_id :SourceMedium :Visits :PagesVisit :AvgVisitDuration :NewVisits :BounceRate]
  [# 352 25104 5.38 332 0.3993 0.3342]
  ;; Processing works fine.  
Now we have a working dataset in :m2 collection. Let's see some data.
;; What is the correlation between :PagesVisit and :AvgVisitDuration ?                                                                                                         
  (correlation (fdr :m2 :PagesVisit) (fdr :m2 :AvgVisitDuration))
  ;; 0.7215517848024166                                                                                                                                                          
  ;; It is a strong correlation. Let's see it in a chart with a linear-model:                                                                                                    
  (lm-chart (fdr :m2 :PagesVisit) (fdr :m2 :AvgVisitDuration) "Pages/Visit" "Avg Visit duration in sec")

Linear model details

  (def lm (linear-model (fdr :m2 :PagesVisit) (fdr :m2 :AvgVisitDuration)))
  (keys lm) ; see what fields are included                                                                                                                                       
  (:design-matrix lm) ; a matrix containing the independent variables, and an intercept columns                                                                                  
  (:coefs lm) ; regression coefficients                                                                                                                                          
  (:t-tests lm) ; t-test values of coefficients                                                                                                                                  
  (:t-probs lm) ; p-values for t-test values of coefficients                                                                                                                     
  (:fitted lm) ; the predicted values of y                                                                                                                                       
  (:residuals lm) ; the residuals of each observation                                                                                                                            
  (:std-errors lm) ; the standard errors of the coeffients                                                                                                                       
  (:sse lm) ; the sum of squared errors,                                                                                                                                         
  (:ssr lm) ; the regression sum of squares                                                                                                                                      
  (:sst lm) ; the total sum of squares                                                                                                                                           
  (:r-square lm) ; coefficient of determination 

Histogram

  (view (histogram (fdr :m2 :PagesVisit) :nbins 50))
  (view (histogram (fdr :m2 :BounceRate) :nbins 50))
  (view (histogram (fdr :m2 :NewVisits) :nbins 50))
  (view (histogram (fdr :m2 :AvgVisitDuration) :nbins 50))
  (view (histogram (fdr :m2 :Visits) :nbins 50))

Classifying Metrics

We use a simple classification: average : 1, below average: 0, above average: 2 for a metric as described in classify-row function. In my next post we take a look at it.

No comments:

Post a Comment