What can we learn from million lines of Groovy code on Github?

Github and Google recently announced and released the Github archive to BigQuery, liberating a huge dataset of source code in multiple programming languages, and making it easier to query it and discover some insights.

Github explained that the dataset comprises over 3 terabytes of data, for 2.8 million repositories, 145 million commits over 2 billion file paths! The Google Cloud Platform blog gave some additional pointers to give hints about what’s possible to do with the querying capabilities of BigQuery. Also, you can have a look at the getting started guide with the steps to follow to have fun yourself with the dataset.

My colleagues Felipe gave some interesting stats about the top programming languages, or licenses, while Francesc did some interesting analysis of Go repositories. So I was curious to investigate myself this dataset to run some queries about the Groovy programming language!

Without further ado, let’s dive in!

If you don’t already have an account of the Google Cloud Platform, you’ll be able to get the free trial, with $300 of credits to discover and have fun with all the products and services of the platform. Then, be sure to have a look at the Github dataset getting started guide I’ve mentioned above which can give you some ideas of things to try out, and the relevant steps to start tinkering with the data.

In the Google Cloud Platform console, I’ve created an empty project (for me, called “github-groovy-files”) that will host my project and the subset of the whole dataset to focus on the Groovy source files only.

Next, we can go to the Github public dataset on BigQuery:
https://bigquery.cloud.google.com/dataset/bigquery-public-data:github_repos

I created a new dataset called “github”, whose location is in the US (the default). Be sure to keep the default location in the US as the Github dataset is in that region already.

I launched the following query to list all the Groovy source files, and save them in a new table called “files” for further querying:

SELECT *
FROM [bigquery-public-data:github_repos.files]
WHERE RIGHT(path, 7) = '.groovy'

Now that I have my own subset of the dataset with only the Groovy files, I ran a count query to know the number of Groovy files available:

SELECT COUNT(*)
FROM [github-groovy-files:github.files]

There are 743 070 of Groovy source files!

I was curious to see if there were some common names of Groovy scripts and classes that would appear more often than others:

SELECT TOP(filename, 24), COUNT(*) as n
FROM (
    SELECT LAST(SPLIT(path, '/')) as filename
    FROM [github.files] 
)

I was surprised to see A.groovy being the most frequent file name! I haven’t dived deeper yet, but I’d be curious to see what’s in those A.groovy files, as well as B.groovy or a.groovy in 4th and 13th positions respectively.

Apache Groovy is often used for various automation tasks, and I’ve found many Maven or Jenkins scripts to check that a certain task or job terminated correctly thanks to scripts called verify.groovy.

Files like BuildConfig.groovy, Config.groovy, UrlMappings.groovy, DataSource.groovy, BootStrap.groovy clearly come from the usual files found in Grails framework web applications.

You can also see configuration files like logback.groovy to configure the Logback logging library.

You don’t see usage of the Gradle build automation tool here, because I only selected files with a .groovy extension, and not files with the .gradle extension. But we’ll come back to Gradle in a moment.

So far, we’ve looked at the file names only, not at their content. That’s where we need another table, coming from the “contents” table of the dataset, that we’ll filter thanks to the file names we’ve saved in our “files” table, thanks to this query:

SELECT *
FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (SELECT id FROM [github.files])

As this is a lot of content, I had to save the result of the query in a new table called “contents”, and I had to check the box “allow large results” in the options pane that you can open thanks to the “Show options” button below the query editor.

From the 743 070 files, how many lines of Groovy code do you think there are in them? For that purpose, we need to split the raw content of the files per lines, as follows:

SELECT 
  COUNT(line) total_lines
FROM (
  SELECT SPLIT(content, '\n') AS line
  FROM [github-groovy-files:github.contents]
)

We have 16,464,376 lines of code over the our 743,070 Groovy files. That’s an average of 22 lines per file, which is pretty low! It would be more interesting to draw some histogram to see the distribution of those lines of code. We can use quantiles to have a better idea of the distribution with this query with 10 quantiles:

SELECT QUANTILES(total_lines, 10) AS q
FROM (
  SELECT 
    COUNT(line) total_lines
  FROM (
    SELECT SPLIT(content, '\n') AS line, id
    FROM [github-groovy-files:github.contents]
  )
  GROUP BY id
)

Which gives this resulting table:

There are files with 0 lines of code! And the biggest one is 9506 lines long! 10% are 11 lines long or less, half are 37 lines or less, etc. And 10% are longer than 149 lines.

Let’s now have a look at packages and imports for a change.

Do you know what are the most frequent packages used?

SELECT package, COUNT(*) count
FROM (
  SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') package, id
  FROM (
    SELECT SPLIT(content, '\n') line, id
    FROM [github-groovy-files:github.contents] 
    WHERE content CONTAINS 'import'
    HAVING LEFT(line, 6)='import' )
    GROUP BY package, id
  )
GROUP BY 1
ORDER BY count DESC
LIMIT 30;

The Spock and JUnit testing frameworks are the most widely used packages, showing that Groovy is used a lot for testing! We also see a lot of Grails and Gradle related packages, and some logging, some Spring, Joda-Time, Java util-concurrent or servlets, etc.

We can zoom in the groovy.* packages with:

SELECT package, COUNT(*) count
FROM (
  SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') package, id
  FROM (
    SELECT SPLIT(content, '\n') line, id
    FROM [github-groovy-files:github.contents] 
    WHERE content CONTAINS 'import'
    HAVING LEFT(line, 6)='import' )
    GROUP BY package, id
  )
WHERE package LIKE 'groovy.%'
GROUP BY 1
ORDER BY count DESC
LIMIT 10;

And ‘groovy.transform’ is unsurprisingly the winner, as it’s where all Groovy AST transformations reside, providing useful code generation capabilities saving developers from writing tedious repetitive code for common tasks (@Immutable, @Delegate, etc.) After transforms come ‘groovy.util.logging’ for logging, ‘groovy.json’ for working with JSON files, ‘groovy.sql’ for interacting with databases through JDBC, ‘groovy.xml’ to parse and produce XML payloads, and ‘groovy.text’ for templating engines:

With Groovy AST transformations being so prominent, we can also look at the most frequently used AST transformations with:

SELECT TOP(class_name, 10) class_name, COUNT(*) count
FROM (
  SELECT 
    REGEXP_EXTRACT(line, r' [a-z0-9\._]*\.([a-zA-Z0-9_]*)') class_name, 
    id
  FROM (
    SELECT SPLIT(content, '\n') line, id
    FROM [github-groovy-files:github.contents] 
    WHERE content CONTAINS 'import'
  )
  WHERE line LIKE '%groovy.transform.%'
  GROUP BY class_name, id
)
WHERE class_name != 'null'

And we get:

The @CompileStatic transformation is the king! Followed by @ToString and @EqualsAndHashCode. But then @TypeChecked is fourth, showing that the static typing and compilation support of Groovy is really heavily used. Other interesting transforms used follow with @Canonical, @PackageScope, @InheritConstructors, @Immutable or @TupleConstructor.

As I was exploring imports, I also wondered whether aliased imports was often seen or not:

SELECT aliased, count(aliased) total
FROM (
  SELECT 
    REGEXP_MATCH(line, r'.* (as) .*') aliased
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.contents]
  )
  WHERE line CONTAINS 'import '
)
GROUP BY aliased
LIMIT 100

Interestingly, there are 2719 aliased imports over 765281 non aliased ones, that’s about 0.36%, so roughly 1 “import … as … “ for 300 normal imports.

And with that, that rounds up my exploration of Groovy source files on Github! It’s your turn to play with the dataset, and see if there are interesting findings to be unveiled! Did you find anything?

Tale of a Groovy Spark in the Cloud

As I recently joined Google’s developer advocacy team for Google Cloud Platform, I thought I could have a little bit of fun with combining my passion for Apache Groovy with some cool cloudy stuff from Google! Incidentally, Paolo Di Tommaso tweeted about his own experiments with using Groovy with Apache Spark, and shared his code on Github:

I thought that would be a nice fun first little project to try to use Groovy to run a Spark job on Google Cloud Dataproc! Dataproc manages Hadoop & Spark for you: it’s a service that provides managed Apache Hadoop, Apache Spark, Apache Pig and Apache Hive. You can easily process big datasets at low cost, control those costs by quickly creating managed clusters of any size and turning them off where you’re done. In addition, you can obviously use all the other Google Cloud Platform services and products from Dataproc (ie. store the big datasets in Google Cloud Storage, on HDFS, through BigQuery, etc.)


More concretely,, how do you run a Groovy job in Google Cloud Dataproc’s managed Spark service? Let’s see that in action!


To get started, I checked out Paolo’s samples from Github, and I even groovy-fied the Pi calculation example (based on this approach) to make it a bit more idiomatic:


package org.apache.spark.examples

import groovy.transform.CompileStatic
import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.api.java.function.Function
import org.apache.spark.api.java.function.Function2
import scala.Function0
@CompileStatic
final class GroovySparkPi {
static void main(String[] args) throws Exception {
  def sparkConf = new SparkConf().setAppName("GroovySparkPi")
  def jsc = new JavaSparkContext(sparkConf)
  int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2
  int n = 100000 * slices
  def dataSet = jsc.parallelize(0..
  def mapper = {
    double x = Math.random() * 2 - 1
    double y = Math.random() * 2 - 1
    return (x * x + y * y < 1) ? 1 : 0
  }
  int count = dataSet
          .map(mapper as Function)
          .reduce({int a, int b -> a + b} as Function2)
  println "Pi is roughly ${4.0 * count / n}"
  jsc.stop()
}
}


You can also use a Groovy script instead of a full-blown class, but you need to make the script serializable with a little trick, by specifying a custom base script class. You need a custom Serializable Script:


import groovy.transform.BaseScript

@BaseScript SerializableScript baseScript

And in your job script, you should specify this is your base script class with:


abstract class SerializableScript extends Script implements Serializable {}

The project comes with a Gradle build file, so you can compile and build your project with the gradle jar command to quickly create a JAR archive.


Now let’s focus on the Cloud Dataproc part of the story! I basically simply followed the quickstart guide. I used the Console (the UI web interface), but you could as well use the gcloud command-line tool as well. You’ll need an account of course, and enable billing, as running Spark jobs on clusters can be potentially expensive, but don’t fear, there’s a free trial that you can take advantage of! You can also do some quick computation with the calculator to estimate how much a certain workload will cost you. In my case, as a one time off job, this is a sub-dollar bill that I have to pay.


Let’s create a brand new project:

We’re going to create a Spark cluster, but we’ll need to enable the Compute Engine API for this to work, so head over to the hamburger menu, select the API manager item, and enable it:



Select the Dataproc menu from the hamburger, which will allow you to create a brand new Spark cluster:

Create a cluster as follows (the smallest one possible for our demo):

Also, in case you have some heavy & expensive workloads, for which it doesn’t matter much if they can be interrupted or not (and then relaunched later on), you could also use Preemptible VMs to further lower the cost.


We created a JAR archive for our Groovy Spark demo, and for the purpose of this demo, we’ll push the JAR into Google Cloud Storage, to create Spark jobs with this JAR (but there are other ways to push your job’s code automatically as well). From the menu again, go to Cloud Storage, and create a new bucket:


Create a bucket with a name of your choice (we’ll need to remember it when creating the Spark jobs):


Once this bucket is created, click on it, and then click on the “upload files” button, to upload your JAR file:


We can come back to the Dataproc section, clicking on the Jobs sub-menu to create a new job:


We’ll create a new job, using our recently created cluster. We’ll need to specify the location of the JAR containing our Spark job: we’ll use the URL gs://groovy-spark-demo-jar/spark-groovy-1.1.jar. The gs:// part corresponds to the Google Cloud Storage protocol, as that’s where we’re hosting our JAR. Then groovy-spark-demo-jar/ corresponds to the name of the bucket we created, and then at the end, the name of the JAR file. We’ll use an argument of 1000 to specify the number of parallel computations of our Pi approximation algorithm we want to run:



Click “Submit”, and here we go, our Groovy Spark job is running in the cloud on our 2-node cluster!

Just a bit of setup through the console, which you can also do from the command-line, and of course a bit of Groovy code to do the computation. Be sure to have a look at the quick start guide, which gives more details than this blog post, and you can look at some other Groovy Spark samples thanks to Paolo on his Github project.

Joining Google as Developer Advocate for the Google Cloud Platform

The cat is out the bag: I’m joining Google on June 6th, as a Developer Advocate for the Google Cloud Platform team!

My Groovy friends will likely remember when I launched Gaelyk, a lightweight toolkit for developing Groovy apps on Google App Engine? Since then, I’ve always been a big fan of the Google Cloud Platform (although it wasn’t called that way then) and followed the latest developments of the whole platform. And wohhh, so many new services and products have seen the light of day since my early experiments with App Engine! So there will be a lot to learn, a lot to do, and thus, a lot to advocate! 

I’m really happy and excited to join the Google Cloud Platform team! I’m looking forward to working with my new team, and join some good old Googler friends I’ve came to know throughout my career.

I’ve been really pleased to work with my friends at Restlet for the past year and a half, made friends, and learnt a lot along the way working with my colleagues. I wish them luck for their great Web API platform, and I’ll continue to follow their progress!

A Groovy journey in Open Source land (GR8Conf Europe)

Direct live from GR8Conf Europe 2016, in Copenhagen, Denmark! This morning, I presented my latest update about the Apache Groovy history, and the latest developments in the 2.4.x and future 2.5 branches. 

Abstract:

In dog years... err... Open Source years, the Groovy programming language project is a very mature and successful one, as its 12 million downloads a year can attest. The Groovy language is certainly the most widely deployed alternative language of the JVM today. But how do we go from a hobby night & week-end project to professionally company sponsored? And back again to hobby mode but joining the wider Apache Software Foundation community?

Guillaume will guide you through the history of the project, its latest developments, and its recent news, outlining the importance of a community around an Open Source project.

Also, we'll discuss what it means to contribute, when it's your hobby or as a paid committer -- what does it change? What it means to join the Apache community, what the impact of professional Open Source is, and more.

And here are the slides:




Get in the flow! The API developer workflow!

What are the activities of the Web API developer? How API tooling should not get in the way of developer's productivity? I presented a talk on this topic at the GlueCon conference:

The API ecosystem provides powerful tools, online services and definition formats for designing, testing, running, or managing APIs. All share common purposes: improve our productivity when developing an API, allow us to collaborate more effectively, or share our creations with the world!

But developers have already invented efficient tactics to streamline their development, gathered experience with and sharpened their tools of trade. The result is that the services or formats mentioned before can actually also get in their way, and interrupt their development flow, as they have to resort to get out of their routine and processes, to use them.

What can API tooling vendors do to reconcile the habits of developers with their tools? 
In this session, Guillaume Laforge, Restlet's Product Ninja & Advocate, will talk about building, versioning & dependency management of API artifacts, scenario & conformance testing, API documentation, continuous integration, multi-environment continuous deployment, and team collaboration! Let’s get back into the development flow!


 
© 2012 Guillaume Laforge | The views and opinions expressed here are mine and don't reflect the ones from my employer.