Nerdy Sermons: Getting Started with Apache Spark: Find maximum commits by an author in a git log file

Monday, March 14, 2016

Getting Started with Apache Spark: Find maximum commits by an author in a git log file

- Install sbt. (scala build tool)
- Install apache-spark.
- Go to the unzipped apache-spark directory and in command line run
sbt assembly

(this takes a while, one may have to increase the memory allocated to run this in config file)

- Clone some git project
git clone https://github.com/apache/groovy

- Save the log into a text file
git log > C:\\temp\\log.txt

- Launch spark terminal and execute :

scala> val file = sc.textFile("C:\\temp\\log.txt")
file: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[21] at textFile at <console>:27

scala> val authorLines = file.filter(line => line.contains("Author"))
authorLines: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[22] at filter at <console>:29

scala> var maxAuthorTuple = authorLines.countByValue().maxBy(_._2)
maxAuthorTuple: (String, Long) = (Author: Paul King <paulk@asert.com.au>,2991)

- Verify that maxAuthorTuple has the author who made maximum commits in that branch with the number of commits.

Nerdy Sermons

Monday, March 14, 2016

Getting Started with Apache Spark: Find maximum commits by an author in a git log file

0 comments:

Post a Comment

Stack Profiles

Archive

Search By Category

About the Author

Followers

Facebook Page

Feedjit

Nerdy Sermons

Monday, March 14, 2016

Getting Started with Apache Spark: Find maximum commits by an author in a git log file

0 comments:

Post a Comment

Stack Profiles

Archive

Search By Category

Subscribe To

About the Author

Followers

Facebook Page

Feedjit