Monday, March 14, 2016

Getting Started with Apache Spark: Find maximum commits by an author in a git log file

- Install sbt. (scala build tool)
- Install apache-spark.
- Go to the unzipped apache-spark directory and in command line run
sbt assembly

(this takes a while, one may have to increase the memory allocated to run this in config file)

- Clone some git project
  git clone https://github.com/apache/groovy

- Save the log into a text file
  git log > C:\\temp\\log.txt

- Launch spark terminal and execute :

scala> val file = sc.textFile("C:\\temp\\log.txt")
file: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[21] at textFile at <console>:27

scala> val authorLines = file.filter(line => line.contains("Author"))
authorLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[22] at filter at <console>:29

scala> var maxAuthorTuple = authorLines.countByValue().maxBy(_._2)
maxAuthorTuple: (String, Long) = (Author: Paul King <paulk@asert.com.au>,2991)

- Verify that maxAuthorTuple has the author who made maximum commits in that branch with the      number of commits.


