Web scraping and REST API calls on App Engine with Jsoup and groovy-wslite

After my Twitter sentiment article, those past couple of days, I've been playing again with the Cloud Natural Language API. This time, I wanted to make a little demo analyzing the text of speeches and remarks published by the press office of the White House. It's interesting to see how speeches alternate negative and positive sequences, to reinforce the argument being exposed.

As usual, for my cloud demos, my weapons of choice for rapid development are Apache Groovy, with Glide & Gaelyk on Google App Engine! But for this demo, I needed two things:
In both cases, we need to issue calls through the internet, and there are some limitations on App Engine with regards to such outbound networking. But if you use the plain Java HTTP / URL networking classes, you are fine. And under the hood, it's using App Engine's own URL Fetch service

I used Jsoup for web scraping, which takes care itself for connecting to the web site.

For interacting with the REST API, groovy-wslight came to my rescue, although I could have used the Java SDK like in my previous article.

Let's look at Jsoup and scraping first. In my controller fetching the content, I did something along those lines (you can run this script in the Groovy console):
import org.jsoup.*
def url = 'https://www.whitehouse.gov/the-press-office/2016/07/17/statement-president-shootings-baton-rouge-louisiana'
def doc = Jsoup.connect(url)
               .userAgent('Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0')
println doc.select('.forall-body .field-item p').collect { it.text() }.join('\n\n')
Now I'm gonna make a call with groovy-wslight to the NL API:
import wslite.rest.*
def apiKey = 'MY_TOP_SECRET_API_KEY'
def client = new RESTClient('https://language.googleapis.com/v1beta1/')
def result = client.post(path: 'documents:annotateText', query: [key: apiKey]) {
    type ContentType.JSON
    json document: [
            type   : 'PLAIN_TEXT',
            content: text
    ], features: [
            extractSyntax           : true,
            extractEntities         : true,
            extractDocumentSentiment: true
// returns a list of parsed sentences
println result.json.sentences.text.content
// prints the overall sentiment of the speech
println result.json.documentSentiment.polarity

Groovy-wslight nicely handles XML and JSON payloads: you can use Groovy maps for the input value, which will be marshalled to JSON transparently, and the GPath notation to easily access the resulting JSON object returned by this API.

It was very quick and straightforward to use Jsoup and groovy-wslight for my web scraping and REST handling needs, and it was a breeze to integrate them in my App Engine application. In a follow-up article, I'll tell you a bit more about the sentiment analysis of the sentences of the speeches, so please stay tuned for the next installment!

Sentiment analysis on tweets

What’s the mood on Twitter today? Looking at my little twitter demo from a few weeks ago (using Glide & Gaelyk on Google App Engine), I thought I could enrich the visualization with some sentiment analysis to give more color to those tweets. Fortunately, there’s a new API in Google-town, the Cloud Natural Language API (some more info in the announcement and a great post showing textual analysis of Harry Potter and New York Times)!

The brand-new Cloud Natural Language API provides three key services:

  • Sentiment analysis: “inspects the given text and identifies the prevailing emotional opinion within the text, especially to determine a writer's attitude as positive, negative, or neutral”.

  • Entity recognition: “inspects the given text for known entities (proper nouns such as public figures, landmarks, etc.) and returns information about those entities”.

  • Syntax analysis: “extracts linguistic information, breaking up the given text into a series of sentences and tokens (generally, word boundaries), providing further analysis on those tokens”.

I’m going to focus only on the sentiment analysis in this article. When analyzing some text, the API tells you whether the content is negative, neutral or positive, returning “polarity” values ranging from -1 for negative to +1 for positive. And you also get a “magnitude”, from 0 to +Infinity to say how strong the emotions expressed are. You can read more about what polarity and magnitude mean for a more thorough understanding.

Let’s get started!

With the code base of my first article, I will add the sentiment analysis associated with the tweets I’m fetching. The idea is to come up with a colorful wall of tweets like this
, with a range of colors from red for negative, to green for positive, through yellow for neutral:

I’ll create a new controller (mood.groovy) that will call the Cloud NL service, passing the text as input. I’ll take advantage of App Engine’s Memcache support to cache the calls to the service, as tweets are immutable, their sentiment won’t change. The controller will return a JSON structure to hold the result of the sentiment analysis. From the index.gtpl view template, I’ll add a bit of JavaScript and AJAX to call my newly created controller.

Setting up the dependencies

You can either use the Cloud NL REST API or the Java SDK. I decided to use the latter, essentially just to benefit from code completion in my IDE. You can have a look at the Java samples provided. I’m updating the glide.gradle file to define my dependencies, including the google-api-services-language artifact which contains the Cloud NL service. I also needed to depend on the Google API client JARs, and Guava. Here’s what my Gradle dependencies ended up looking like:

dependencies {
   compile "com.google.api-client:google-api-client:1.21.0"
   compile "com.google.api-client:google-api-client-appengine:1.21.0"
   compile "com.google.api-client:google-api-client-servlet:1.21.0"
   compile "com.google.guava:guava:19.0"
   compile "com.google.apis:google-api-services-language:v1beta1-rev1-1.22.0"
   compile "org.twitter4j:twitter4j-appengine:4.0.4"

Creating a new route for the mood controller

First, let’s create a new route in _routes.groovy to point at the new controller:

post "/mood",       forward:  "/mood.groovy"

Coding the mood controller

Now let’s code the mood.groovy controller!

We’ll need quite a few imports for the Google API client classes, and a couple more for the Cloud Natural Language API:

import com.google.api.client.googleapis.json.GoogleJsonResponseException
import com.google.api.client.http.*
import com.google.api.client.googleapis.auth.oauth2.GoogleCredential
import com.google.api.client.googleapis.javanet.GoogleNetHttpTransport
import com.google.api.client.json.jackson2.JacksonFactory
import com.google.api.services.language.v1beta1.*
import com.google.api.services.language.v1beta1.model.*

We’re retrieving the text as a parameter, with the params map:

def text = params.txt

We’ve set up a few local variables that we’ll use for storing and returning the result of the sentiment analysis invocation:

def successOutcome = true
def reason = ""
def polarity = 0
def magnitude = 0

Let’s check if we have already got the sentiment analysis for the text parameter in Memcache:

def cachedResult = memcache[text]

If it’s in the cache, we’ll be able to return it, otherwise, it’s time to compute it:

if (!cachedResult) {
   try {
       // the sentiment analysis calling will be here
   } catch (Throwable t) {
       successOutcome = false
       reason = t.message

We’re going to wrap our service call with a bit of exception handling, in case something goes wrong, we want to alert the user of what’s going on. And in lieu of the comment, we’ll add some logic to analyze the sentiment

We must define the Google credentials allowing us to access the API. Rather than explaining the whole process, please follow the authentication process explained in the documentation to create an API key and a service account:

        def credential = GoogleCredential.applicationDefault.createScoped(CloudNaturalLanguageAPIScopes.all())

Now we can create our Cloud Natural Language API caller:

        def api = new CloudNaturalLanguageAPI.Builder(
               new HttpRequestInitializer() {
                   void initialize(HttpRequest httpRequest) throws IOException {

The caller requires some parameters like an HTTP transport, a JSON factory, and a request initializer that double checks that we’re allowed to make those API calls. Now that the API is set up, we can call it:

        def sentimentResponse = api.documents().analyzeSentiment(
               new AnalyzeSentimentRequest(document: new Document(content: text, type: "PLAIN_TEXT"))

We created an AnalyzeSentimentRequest, passing a Document to analyze with the text of our tweets. Finally, we execute that request. With the values from the response, we’re going to assign our polarity and magnitude variables:

        polarity = sentimentResponse.documentSentiment.polarity
       magnitude = sentimentResponse.documentSentiment.magnitude

Then, we’re going to store the result (successful or not) in Memcache:

    cachedResult = [
           success: successOutcome,
           message: reason,
           polarity: sentiment?.polarity ?: 0.0,
           magnitude: sentiment?.magnitude ?: 0.0
   memcache[text] = cachedResult

Now, we setup the JSON content type for the answer, and we can render the cachedResult map as a JSON object with the Groovy JSON builder available inside all controllers:

response.contentType = 'application/json'
json.result cachedResult

Calling our controller from the view

A bit of JavaScript & AJAX to the rescue to call the mood controller! I wanted something a bit lighter than jQuery, so I went with Zepto.js for fun. It’s pretty much the same API as jQuery anyway. Just before the end of the body, you can install Zepto from a CDN with:

<script src="https://cdnjs.cloudflare.com/ajax/libs/zepto/1.1.6/zepto.min.js"></script>

Then, we’ll open up our script tag for some coding:

<script language="javascript">
   Zepto(function(z) {
       // some magic here!

As the sentiment analysis API call doesn’t support batch requests, we’ll have to call the API for  each and every tweet. So let’s iterate over each tweet:

        z('.tweet').forEach(function(e, idx) {
           var txt = z(e).data('text');
           // ....

Compared to the previous article, I’ve added a data-text attribute to contain the text of the tweet, stripped from hashtags, twitter handles and links (I’ll let you use some regex magic to scratch those bits of text!).

Next, I call my mood controller, passing the trimmed text as input, and check if the response is successful:

            z.post('/mood', { txt: txt }, function(resp) {
               if (resp.result.success) {
                   // …

I retrieve the polarity and magnitude from the JSON payload returned by my mood controller:

                    var polarity = resp.result.polarity;
                   var magnitude = resp.result.magnitude;

Then I update the background color of my tweets with the following approach. I’m using the HSL color space: Hue, Saturation, Lightness.

The hue ranges from 0 to 360°, and for my tweets, I’m using the first third, from red / 0°, through yellow / 60°, up to green / 120° to represent the polarity, respectively with negative / -1, neutral / 0 and positive / +1.

The saturation (in percents) corresponds to the magnitude. For tweets which are small, the magnitude rarely goes beyond 1, so I simply multiply the magnitude by 100 to get percentages, and floors the results to 100% if it goes beyond.

For the lightness, I’ve got a fixed value of 80%, as 100% would always be full white!

Here’s a more explicit visualization of this color encoding with the following graph:

So what does the code looks like, with the DOM updates with Zepto?

                    var hsl = 'hsl(' +
                           Math.floor((polarity + 1) * 60) + ', ' +
                           Math.min(Math.floor(magnitude * 100), 100) + '%, ' +
                           '80%) !important';
                           .css('background-color', hsl)
                           .data('polarity', polarity)
                           .data('magnitude', magnitude);

For the fun, I’ve also added some smileys to represent five buckets of positivity / negativity (very negative, negative, neutral, positive, very positive), and from 0 to 3 exclamation marks for 4 buckets of magnitude. That’s what you see in the bottom of the tweet cards in the final screenshot:


But we’re actually done! We have our controller fetching the tweets forwarding to the view template from the last article, and we added a bit of JavaScript & AJAX to call our new mood controller, to display some fancy colors to represent the mood of our tweets, using the brand new Cloud Natural Language API.

When playing with sentiment analysis, I was generally on the same opinion regarding sentiment of the tweets, but I was sometimes surprised by the outcome. It’s hard for short bursts of text like tweets to decipher things like irony, sarcasm, etc, and a particular tweet might appear positive when reality it isn’t, and vice versa. Sentiment analysis is probably not an exact science, and you need more context to decide what’s really positive or negative.

Without even speaking of sarcasm or irony, sometimes certain tweets were deemed negative when some particular usually negative words appeared: a “no” or “not” is not necessarily negative when it’s negating something already negative, turning it into something more positive (“it’s not uncool”). For longer text, the general sentiment seems more accurate, so perhaps it’s more appropriate to use sentiment analysis in such cases than on short snippets.

Getting started with Glide and Gaelyk on Google App Engine

Back in 2009, I created Gaelyk, a lightweight toolkit for developing Google App Engine apps using the Apache Groovy programming language. I even had the chance to speak at Google I/O 2009 about it! Good times, good times… Vladimír Oraný later joined me in maintaining and evolving Gaelyk, and Kunal Dabir created the fun Glide project, which is a thin wrapper around Gaelyk to further streamline the development of small to mid-sized apps for Google App Engine.

Today, I want to share with you a quick start guide to develop a little app, that shows some tweets from selected accounts with the Twitter API (thanks to Twitter4J), and using the Material Design Light template for the look’n feel (I used the “dashboard” template). I won’t list all the exact steps, all the precise changes made to the templates, etc, but I want to give you the keys for having a productive experience with Glide and Gaelyk on App Engine. And here’s a screenshot of what we’ll be building:

Ready? Let’s start!

Installing Glide

In the Groovy community, most developers these days are using SDKMan to install SDKs for Groovy, Gradle, Grails, and more. Glide also comes in the form of an SDK, with a command-line, and is available via SDKMan. So, first step, let’s install SDKMan from your shell (there’s also a Windows-friendly version):

$ curl -s "https://get.sdkman.io" | bash

It will automatically install the SDK manager. Then, either you just open up a new terminal, or you run the following command, to have access to SDKMan in your current session:

$ source "$HOME/.sdkman/bin/sdkman-init.sh"

To check the installation succeeded, you can run it with the sdk command, for example by printing the current version of SDKMan:

$ sdk version

Now that SDKMan is installed, it’s time to install Glide as well:

$ sdk install glide

You can then check that glide is indeed functioning correctly by executing:

$ glide

If you’re on Windows or if you’re not planning to keep SDKMan around, you can also install Glide by other means, manually, as explained in the documentation.

Creating the skeleton of our application

Okay, we’re ready to create our first Glide / Gaelyk application! (You can also check out the Glide tutorial as well)

$ glide --app tweetapp create
$ cd tweetapp
$ glide run

Head over to your browser at http://localhost:8080/, and you’ll see a brilliant “hello glide” message showing up. So far so good, the app is running locally, transparently thanks to the App Engine SDK, now let’s tweak this skeleton!

The project structure is pretty simple, in the directory, you’ll see a glide.groovy file at the root, and an “app” sub-folder containing index.groovy and _routes.groovy:

  • glide.groovy — the configuration file for your app
  • index.groovy — the default controller
  • _routes.groovy — listing the mappings between URLs and controllers, and more
Configuring our application

In glide.groovy, you’ll have the app name and version name defined:

app {

You might have to change the application name, as we shall see later on, when we deploy the application.

To use the latest version of the App Engine SDK, you can append the following to explicitly ask for a specific version of the SDK:

glide {
    versions {
        appengineVersion = "1.9.38"

Defining library dependencies

At the root of our project, we’ll actually add a new configuration file: glide.gradle. This file will allow us to define library dependencies. It’s basically a fragment of a Gradle build configuration, where you can define those dependencies using the usual Gradle syntax. In our glide.gradle file, we’ll add the following dependency, for our Twitter integration:

dependencies {
    compile "org.twitter4j:twitter4j-appengine:4.0.4"

Using the Material Design Lite template

To make things pretty, we’ll be using the Material Design Lite dashboard sample, but feel free to skip this part if you want to go straight to the coding part! Download the ZIP archive. It comes with an index.html file, as well as a style.css stylesheet. We’ll copy both files to the app/ folder, but we’ll rename index.html into index.gtpl (to make it a Groovy template file).

When you have a bigger project, with more assets, it’s obviously better to organize these views, stylesheets, JavaScript files, images, etc, in their own respective sub-folders. But for the purpose of my demo, I’ll keep everything in the same place.

You’ll see the template installed and visible if you go to this local URL:


I won’t detail all the changes to make to the template, and I’ll let you clean the template yourselves, but we can already remove everything that’s inside the inner div of the main tag: that’s where we’ll display our tweets!

Let's make pretty URLs!

We’d like to have some nice URLs for our app. For that, we’ll now have a look at the _routes.groovy file where you can define your URL mappings, to point at templates (*.gtpl files) or at controllers (*.groovy files, that can render some output directly or forward to templates for rich views). What shall we put in our routes definitions?

get "/",            redirect: "/u/glaforge"
get "/u/@who",      forward:  "/index.groovy?u=@who",
        validate: { request.who ==~ /[a-zA-Z0-9_]{1,15}/ },
        cache: 1.minute
get "/u/@who",      forward:  "/index.groovy?u=@who&error=invalid"

You can have a look at the Gaelyk documentation that defines the routes definition syntax for further explanations on what’s possible.

The root of the app, ‘/’, will redirect to /u/glaforge, to visualize my latest tweets. And all URLs like ‘/u/*’ will forward to our index.groovy controller, that will fetch the tweets for that Twitter user, and forward them to the index.gtpl view for rendering the result.

/u/glaforge → /index.groovy?u=glaforge → /index.gtpl

The routing syntax is using the @foo notation to denote query path variables, that we can then reuse in the forwarding part.

The routing rules are evaluated in order, and the first one that matches the URL will be chosen. We have two get "/u/@who" routes, in the first case, we have a validation rule that checks that the @who path variable is a valid Twitter handle (using Groovy’s regular expression matching operator). If the validation fails, this route isn’t chosen, and the chain will continue, and it will fall back to the following route that forwards to the template with an error query parameter.

Also interesting to note is the use of caching, with:

cache: 1.minute

The output of this URL will be put in App Engine’s Memcache so that for the next minute, all requests to the same URL will be fetched from the cache, rather than having to call again the controller and the Twitter API, thus saving on computation and on third-party API call quota.

For the purpose of development, you might want to comment that caching configuration, as you do want to see changes to that template or controller as you’re making changes.

Time to code our tweet fetching controller

To user the Twitter API, you’ll have to register a new application on the Twitter Apps page. Twitter will give you the right credentials that you’ll need to connect to the API. You’ll need the four following keys to configure the Twitter4J library:

  • the consumer API key
  • the secrete consumer API key
  • the access token
  • and the secret access token

Let’s configure Twitter4J with that information. I’ll implement the “happy path” and will skip part of the proper error handling (an exercise for the reader?), to keep the code short for this article.

import twitter4j.*
import twitter4j.conf.*
def conf = new ConfigurationBuilder(
        debugEnabled: true,
        OAuthAccessToken: "CHANGE_ME",
        OAuthAccessTokenSecret: "CHANGE_ME",
        OAuthConsumerKey: "CHANGE_ME",
        OAuthConsumerSecret: "CHANGE_ME").build()
def twitter = new TwitterFactory(conf).instance

The API is configured with your credentials. Be sure to replace all the CHANGE_ME bits, obviously!

Let’s lookup the Twitter handle coming through the query parameter, thanks to the user ‘u’ attribute on the params map:

def accounts = twitter.lookupUsers(params.u)

There should only be two cases (that’s where there may be some more error handling to do!): 1) there’s no user found, or there’s only one. Let’s start with no user found:

if (accounts.isEmpty()) {
    request.errorMessage = message: "Account '${params.u}' doesn't exist."

If no user account was found, we’ll put an error message in the request that’ll be forwarded to our view template.

In the else branch, we’ll handle the the normal case where the user was found:

} else {
    User userAccount = accounts[0]
    def tweets = twitter.search(new Query("from:${params.u}"))
            .tweets.findAll { !it.isRetweet() }

We get the first account returned, and issue a search request for the latest tweets from that account. We filter out the retweets to keep only the user’s original tweets (but it’s up to you if you want to keep them).

In the request for the view, we’ll add details about the account:

    request.account = [
            name  : userAccount.name,
            handle: userAccount.screenName,
            avatar: userAccount.biggerProfileImageURL

And we’ll also add the list of tweets:

    request.tweets = tweets.collect { Status s ->
                id       : s.id,
              timestamp: s.createdAt.time,
                content  : s.text

And to finish our controller, we’ll forward to the view:

forward 'index.gtpl'

Now that our controller is ready, we’ll have to surface the data into the view template.

Modify the view template

Wherever the template displays the “Home” label, we’ll replace these with the Twitter handle. For that, we can use String interpolation in the template with the ${} notation. If there’s no error message, there should be an account, and we display that handle.

${ request.errorMessage ? 'Home' : '@' +request.account.handle }

Let’s display the list of tweets, or the error message if there’s one. We’ll iterate over the tweets from the request attributes, and add the following in the inner div of the main tag (fore brevity sake, I'll remove the divs and css needed to make things pretty):

          if (request.tweets) {
            request.tweets.each { tweet ->
          } else { %>
          <% } %>

And voila, our app is ready! Well, at least, it works locally on our app server, but it’s time to deploy it for real on App Engine!

Deploying to Google App Engine

Let’s login in the Google Cloud Platform console to create our application project. If you don’t already have an account, you can benefit from the free trial which offers $300 of credits for the full platform.

Be sure to pay attention to the actual project ID that will have been created, it may be slightly different than the project name itself. This project ID is also called the app ID, and that’s the actually what you have to put in the glide.groovy file, in the app.name field (right, it’s a bit confusing, isn’t it?)

When the project is created, you’re able to use the glide command-line to deploy the application:

$ glide upload

If you see an error like below in the logs, it might mean that there’s a problem with your app ID, so be sure to double check it’s correct:

403 Forbidden
You do not have permission to modify this app (app_id=u's~my-tweet-demo').

Another occurrence of this error message is when you are using different accounts with Google Cloud Platform. For instance, in my case, I have both a personal gmail account for my personal apps, and a google.com account for my work related apps. I had to zap ~/.appcfg_oauth2_tokens_java to let the upload logic to use the correct account, and ask me to authentication with OAuth2.

Once the upload succeeded, you can access your app here:


Hooray, you’ve done it! :-)

What can we learn from million lines of Groovy code on Github?

Github and Google recently announced and released the Github archive to BigQuery, liberating a huge dataset of source code in multiple programming languages, and making it easier to query it and discover some insights.

Github explained that the dataset comprises over 3 terabytes of data, for 2.8 million repositories, 145 million commits over 2 billion file paths! The Google Cloud Platform blog gave some additional pointers to give hints about what’s possible to do with the querying capabilities of BigQuery. Also, you can have a look at the getting started guide with the steps to follow to have fun yourself with the dataset.

My colleagues Felipe gave some interesting stats about the top programming languages, or licenses, while Francesc did some interesting analysis of Go repositories. So I was curious to investigate myself this dataset to run some queries about the Groovy programming language!

Without further ado, let’s dive in!

If you don’t already have an account of the Google Cloud Platform, you’ll be able to get the free trial, with $300 of credits to discover and have fun with all the products and services of the platform. Then, be sure to have a look at the Github dataset getting started guide I’ve mentioned above which can give you some ideas of things to try out, and the relevant steps to start tinkering with the data.

In the Google Cloud Platform console, I’ve created an empty project (for me, called “github-groovy-files”) that will host my project and the subset of the whole dataset to focus on the Groovy source files only.

Next, we can go to the Github public dataset on BigQuery:

I created a new dataset called “github”, whose location is in the US (the default). Be sure to keep the default location in the US as the Github dataset is in that region already.

I launched the following query to list all the Groovy source files, and save them in a new table called “files” for further querying:

FROM [bigquery-public-data:github_repos.files]
WHERE RIGHT(path, 7) = '.groovy'

Now that I have my own subset of the dataset with only the Groovy files, I ran a count query to know the number of Groovy files available:

FROM [github-groovy-files:github.files]

There are 743 070 of Groovy source files!

I was curious to see if there were some common names of Groovy scripts and classes that would appear more often than others:

SELECT TOP(filename, 24), COUNT(*) as n
    SELECT LAST(SPLIT(path, '/')) as filename
    FROM [github.files] 

I was surprised to see A.groovy being the most frequent file name! I haven’t dived deeper yet, but I’d be curious to see what’s in those A.groovy files, as well as B.groovy or a.groovy in 4th and 13th positions respectively.

Apache Groovy is often used for various automation tasks, and I’ve found many Maven or Jenkins scripts to check that a certain task or job terminated correctly thanks to scripts called verify.groovy.

Files like BuildConfig.groovy, Config.groovy, UrlMappings.groovy, DataSource.groovy, BootStrap.groovy clearly come from the usual files found in Grails framework web applications.

You can also see configuration files like logback.groovy to configure the Logback logging library.

You don’t see usage of the Gradle build automation tool here, because I only selected files with a .groovy extension, and not files with the .gradle extension. But we’ll come back to Gradle in a moment.

So far, we’ve looked at the file names only, not at their content. That’s where we need another table, coming from the “contents” table of the dataset, that we’ll filter thanks to the file names we’ve saved in our “files” table, thanks to this query:

FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (SELECT id FROM [github.files])

As this is a lot of content, I had to save the result of the query in a new table called “contents”, and I had to check the box “allow large results” in the options pane that you can open thanks to the “Show options” button below the query editor.

From the 743 070 files, how many lines of Groovy code do you think there are in them? For that purpose, we need to split the raw content of the files per lines, as follows:

  COUNT(line) total_lines
  SELECT SPLIT(content, '\n') AS line
  FROM [github-groovy-files:github.contents]

We have 16,464,376 lines of code over the our 743,070 Groovy files. That’s an average of 22 lines per file, which is pretty low! It would be more interesting to draw some histogram to see the distribution of those lines of code. We can use quantiles to have a better idea of the distribution with this query with 10 quantiles:

SELECT QUANTILES(total_lines, 10) AS q
    COUNT(line) total_lines
  FROM (
    SELECT SPLIT(content, '\n') AS line, id
    FROM [github-groovy-files:github.contents]

Which gives this resulting table:

There are files with 0 lines of code! And the biggest one is 9506 lines long! 10% are 11 lines long or less, half are 37 lines or less, etc. And 10% are longer than 149 lines.

Let’s now have a look at packages and imports for a change.

Do you know what are the most frequent packages used?

SELECT package, COUNT(*) count
  SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') package, id
  FROM (
    SELECT SPLIT(content, '\n') line, id
    FROM [github-groovy-files:github.contents] 
    WHERE content CONTAINS 'import'
    HAVING LEFT(line, 6)='import' )
    GROUP BY package, id

The Spock and JUnit testing frameworks are the most widely used packages, showing that Groovy is used a lot for testing! We also see a lot of Grails and Gradle related packages, and some logging, some Spring, Joda-Time, Java util-concurrent or servlets, etc.

We can zoom in the groovy.* packages with:

SELECT package, COUNT(*) count
  SELECT REGEXP_EXTRACT(line, r' ([a-z0-9\._]*)\.') package, id
  FROM (
    SELECT SPLIT(content, '\n') line, id
    FROM [github-groovy-files:github.contents] 
    WHERE content CONTAINS 'import'
    HAVING LEFT(line, 6)='import' )
    GROUP BY package, id
WHERE package LIKE 'groovy.%'

And ‘groovy.transform’ is unsurprisingly the winner, as it’s where all Groovy AST transformations reside, providing useful code generation capabilities saving developers from writing tedious repetitive code for common tasks (@Immutable, @Delegate, etc.) After transforms come ‘groovy.util.logging’ for logging, ‘groovy.json’ for working with JSON files, ‘groovy.sql’ for interacting with databases through JDBC, ‘groovy.xml’ to parse and produce XML payloads, and ‘groovy.text’ for templating engines:

With Groovy AST transformations being so prominent, we can also look at the most frequently used AST transformations with:

SELECT TOP(class_name, 10) class_name, COUNT(*) count
    REGEXP_EXTRACT(line, r' [a-z0-9\._]*\.([a-zA-Z0-9_]*)') class_name, 
  FROM (
    SELECT SPLIT(content, '\n') line, id
    FROM [github-groovy-files:github.contents] 
    WHERE content CONTAINS 'import'
  WHERE line LIKE '%groovy.transform.%'
  GROUP BY class_name, id
WHERE class_name != 'null'

And we get:

The @CompileStatic transformation is the king! Followed by @ToString and @EqualsAndHashCode. But then @TypeChecked is fourth, showing that the static typing and compilation support of Groovy is really heavily used. Other interesting transforms used follow with @Canonical, @PackageScope, @InheritConstructors, @Immutable or @TupleConstructor.

As I was exploring imports, I also wondered whether aliased imports was often seen or not:

SELECT aliased, count(aliased) total
    REGEXP_MATCH(line, r'.* (as) .*') aliased
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.contents]
  WHERE line CONTAINS 'import '
GROUP BY aliased

Interestingly, there are 2719 aliased imports over 765281 non aliased ones, that’s about 0.36%, so roughly 1 “import … as … “ for 300 normal imports.

And with that, that rounds up my exploration of Groovy source files on Github! It’s your turn to play with the dataset, and see if there are interesting findings to be unveiled! Did you find anything?

Tale of a Groovy Spark in the Cloud

As I recently joined Google’s developer advocacy team for Google Cloud Platform, I thought I could have a little bit of fun with combining my passion for Apache Groovy with some cool cloudy stuff from Google! Incidentally, Paolo Di Tommaso tweeted about his own experiments with using Groovy with Apache Spark, and shared his code on Github:

I thought that would be a nice fun first little project to try to use Groovy to run a Spark job on Google Cloud Dataproc! Dataproc manages Hadoop & Spark for you: it’s a service that provides managed Apache Hadoop, Apache Spark, Apache Pig and Apache Hive. You can easily process big datasets at low cost, control those costs by quickly creating managed clusters of any size and turning them off where you’re done. In addition, you can obviously use all the other Google Cloud Platform services and products from Dataproc (ie. store the big datasets in Google Cloud Storage, on HDFS, through BigQuery, etc.)

More concretely,, how do you run a Groovy job in Google Cloud Dataproc’s managed Spark service? Let’s see that in action!

To get started, I checked out Paolo’s samples from Github, and I even groovy-fied the Pi calculation example (based on this approach) to make it a bit more idiomatic:

package org.apache.spark.examples

import groovy.transform.CompileStatic
import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.api.java.function.Function
import org.apache.spark.api.java.function.Function2
import scala.Function0
final class GroovySparkPi {
static void main(String[] args) throws Exception {
  def sparkConf = new SparkConf().setAppName("GroovySparkPi")
  def jsc = new JavaSparkContext(sparkConf)
  int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2
  int n = 100000 * slices
  def dataSet = jsc.parallelize(0..
  def mapper = {
    double x = Math.random() * 2 - 1
    double y = Math.random() * 2 - 1
    return (x * x + y * y < 1) ? 1 : 0
  int count = dataSet
          .map(mapper as Function)
          .reduce({int a, int b -> a + b} as Function2)
  println "Pi is roughly ${4.0 * count / n}"

You can also use a Groovy script instead of a full-blown class, but you need to make the script serializable with a little trick, by specifying a custom base script class. You need a custom Serializable Script:

import groovy.transform.BaseScript

@BaseScript SerializableScript baseScript

And in your job script, you should specify this is your base script class with:

abstract class SerializableScript extends Script implements Serializable {}

The project comes with a Gradle build file, so you can compile and build your project with the gradle jar command to quickly create a JAR archive.

Now let’s focus on the Cloud Dataproc part of the story! I basically simply followed the quickstart guide. I used the Console (the UI web interface), but you could as well use the gcloud command-line tool as well. You’ll need an account of course, and enable billing, as running Spark jobs on clusters can be potentially expensive, but don’t fear, there’s a free trial that you can take advantage of! You can also do some quick computation with the calculator to estimate how much a certain workload will cost you. In my case, as a one time off job, this is a sub-dollar bill that I have to pay.

Let’s create a brand new project:

We’re going to create a Spark cluster, but we’ll need to enable the Compute Engine API for this to work, so head over to the hamburger menu, select the API manager item, and enable it:

Select the Dataproc menu from the hamburger, which will allow you to create a brand new Spark cluster:

Create a cluster as follows (the smallest one possible for our demo):

Also, in case you have some heavy & expensive workloads, for which it doesn’t matter much if they can be interrupted or not (and then relaunched later on), you could also use Preemptible VMs to further lower the cost.

We created a JAR archive for our Groovy Spark demo, and for the purpose of this demo, we’ll push the JAR into Google Cloud Storage, to create Spark jobs with this JAR (but there are other ways to push your job’s code automatically as well). From the menu again, go to Cloud Storage, and create a new bucket:

Create a bucket with a name of your choice (we’ll need to remember it when creating the Spark jobs):

Once this bucket is created, click on it, and then click on the “upload files” button, to upload your JAR file:

We can come back to the Dataproc section, clicking on the Jobs sub-menu to create a new job:

We’ll create a new job, using our recently created cluster. We’ll need to specify the location of the JAR containing our Spark job: we’ll use the URL gs://groovy-spark-demo-jar/spark-groovy-1.1.jar. The gs:// part corresponds to the Google Cloud Storage protocol, as that’s where we’re hosting our JAR. Then groovy-spark-demo-jar/ corresponds to the name of the bucket we created, and then at the end, the name of the JAR file. We’ll use an argument of 1000 to specify the number of parallel computations of our Pi approximation algorithm we want to run:

Click “Submit”, and here we go, our Groovy Spark job is running in the cloud on our 2-node cluster!

Just a bit of setup through the console, which you can also do from the command-line, and of course a bit of Groovy code to do the computation. Be sure to have a look at the quick start guide, which gives more details than this blog post, and you can look at some other Groovy Spark samples thanks to Paolo on his Github project.
© 2012 Guillaume Laforge | The views and opinions expressed here are mine and don't reflect the ones from my employer.