Tale of a Groovy Spark in the Cloud

As I recently joined Google’s developer advocacy team for Google Cloud Platform, I thought I could have a little bit of fun with combining my passion for Apache Groovy with some cool cloudy stuff from Google! Incidentally, Paolo Di Tommaso tweeted about his own experiments with using Groovy with Apache Spark, and shared his code on Github:

I thought that would be a nice fun first little project to try to use Groovy to run a Spark job on Google Cloud Dataproc! Dataproc manages Hadoop & Spark for you: it’s a service that provides managed Apache Hadoop, Apache Spark, Apache Pig and Apache Hive. You can easily process big datasets at low cost, control those costs by quickly creating managed clusters of any size and turning them off where you’re done. In addition, you can obviously use all the other Google Cloud Platform services and products from Dataproc (ie. store the big datasets in Google Cloud Storage, on HDFS, through BigQuery, etc.)

More concretely,, how do you run a Groovy job in Google Cloud Dataproc’s managed Spark service? Let’s see that in action!

To get started, I checked out Paolo’s samples from Github, and I even groovy-fied the Pi calculation example (based on this approach) to make it a bit more idiomatic:

package org.apache.spark.examples

import groovy.transform.CompileStatic
import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import org.apache.spark.api.java.function.Function
import org.apache.spark.api.java.function.Function2
import scala.Function0
final class GroovySparkPi {
static void main(String[] args) throws Exception {
  def sparkConf = new SparkConf().setAppName("GroovySparkPi")
  def jsc = new JavaSparkContext(sparkConf)
  int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2
  int n = 100000 * slices
  def dataSet = jsc.parallelize(0..
  def mapper = {
    double x = Math.random() * 2 - 1
    double y = Math.random() * 2 - 1
    return (x * x + y * y < 1) ? 1 : 0
  int count = dataSet
          .map(mapper as Function)
          .reduce({int a, int b -> a + b} as Function2)
  println "Pi is roughly ${4.0 * count / n}"

You can also use a Groovy script instead of a full-blown class, but you need to make the script serializable with a little trick, by specifying a custom base script class. You need a custom Serializable Script:

import groovy.transform.BaseScript

@BaseScript SerializableScript baseScript

And in your job script, you should specify this is your base script class with:

abstract class SerializableScript extends Script implements Serializable {}

The project comes with a Gradle build file, so you can compile and build your project with the gradle jar command to quickly create a JAR archive.

Now let’s focus on the Cloud Dataproc part of the story! I basically simply followed the quickstart guide. I used the Console (the UI web interface), but you could as well use the gcloud command-line tool as well. You’ll need an account of course, and enable billing, as running Spark jobs on clusters can be potentially expensive, but don’t fear, there’s a free trial that you can take advantage of! You can also do some quick computation with the calculator to estimate how much a certain workload will cost you. In my case, as a one time off job, this is a sub-dollar bill that I have to pay.

Let’s create a brand new project:

We’re going to create a Spark cluster, but we’ll need to enable the Compute Engine API for this to work, so head over to the hamburger menu, select the API manager item, and enable it:

Select the Dataproc menu from the hamburger, which will allow you to create a brand new Spark cluster:

Create a cluster as follows (the smallest one possible for our demo):

Also, in case you have some heavy & expensive workloads, for which it doesn’t matter much if they can be interrupted or not (and then relaunched later on), you could also use Preemptible VMs to further lower the cost.

We created a JAR archive for our Groovy Spark demo, and for the purpose of this demo, we’ll push the JAR into Google Cloud Storage, to create Spark jobs with this JAR (but there are other ways to push your job’s code automatically as well). From the menu again, go to Cloud Storage, and create a new bucket:

Create a bucket with a name of your choice (we’ll need to remember it when creating the Spark jobs):

Once this bucket is created, click on it, and then click on the “upload files” button, to upload your JAR file:

We can come back to the Dataproc section, clicking on the Jobs sub-menu to create a new job:

We’ll create a new job, using our recently created cluster. We’ll need to specify the location of the JAR containing our Spark job: we’ll use the URL gs://groovy-spark-demo-jar/spark-groovy-1.1.jar. The gs:// part corresponds to the Google Cloud Storage protocol, as that’s where we’re hosting our JAR. Then groovy-spark-demo-jar/ corresponds to the name of the bucket we created, and then at the end, the name of the JAR file. We’ll use an argument of 1000 to specify the number of parallel computations of our Pi approximation algorithm we want to run:

Click “Submit”, and here we go, our Groovy Spark job is running in the cloud on our 2-node cluster!

Just a bit of setup through the console, which you can also do from the command-line, and of course a bit of Groovy code to do the computation. Be sure to have a look at the quick start guide, which gives more details than this blog post, and you can look at some other Groovy Spark samples thanks to Paolo on his Github project.

Joining Google as Developer Advocate for the Google Cloud Platform

The cat is out the bag: I’m joining Google on June 6th, as a Developer Advocate for the Google Cloud Platform team!

My Groovy friends will likely remember when I launched Gaelyk, a lightweight toolkit for developing Groovy apps on Google App Engine? Since then, I’ve always been a big fan of the Google Cloud Platform (although it wasn’t called that way then) and followed the latest developments of the whole platform. And wohhh, so many new services and products have seen the light of day since my early experiments with App Engine! So there will be a lot to learn, a lot to do, and thus, a lot to advocate! 

I’m really happy and excited to join the Google Cloud Platform team! I’m looking forward to working with my new team, and join some good old Googler friends I’ve came to know throughout my career.

I’ve been really pleased to work with my friends at Restlet for the past year and a half, made friends, and learnt a lot along the way working with my colleagues. I wish them luck for their great Web API platform, and I’ll continue to follow their progress!

A Groovy journey in Open Source land (GR8Conf Europe)

Direct live from GR8Conf Europe 2016, in Copenhagen, Denmark! This morning, I presented my latest update about the Apache Groovy history, and the latest developments in the 2.4.x and future 2.5 branches. 


In dog years... err... Open Source years, the Groovy programming language project is a very mature and successful one, as its 12 million downloads a year can attest. The Groovy language is certainly the most widely deployed alternative language of the JVM today. But how do we go from a hobby night & week-end project to professionally company sponsored? And back again to hobby mode but joining the wider Apache Software Foundation community?

Guillaume will guide you through the history of the project, its latest developments, and its recent news, outlining the importance of a community around an Open Source project.

Also, we'll discuss what it means to contribute, when it's your hobby or as a paid committer -- what does it change? What it means to join the Apache community, what the impact of professional Open Source is, and more.

And here are the slides:

Get in the flow! The API developer workflow!

What are the activities of the Web API developer? How API tooling should not get in the way of developer's productivity? I presented a talk on this topic at the GlueCon conference:

The API ecosystem provides powerful tools, online services and definition formats for designing, testing, running, or managing APIs. All share common purposes: improve our productivity when developing an API, allow us to collaborate more effectively, or share our creations with the world!

But developers have already invented efficient tactics to streamline their development, gathered experience with and sharpened their tools of trade. The result is that the services or formats mentioned before can actually also get in their way, and interrupt their development flow, as they have to resort to get out of their routine and processes, to use them.

What can API tooling vendors do to reconcile the habits of developers with their tools? 
In this session, Guillaume Laforge, Restlet's Product Ninja & Advocate, will talk about building, versioning & dependency management of API artifacts, scenario & conformance testing, API documentation, continuous integration, multi-environment continuous deployment, and team collaboration! Let’s get back into the development flow!

A five-sided prism polarizing Web API development

At GlueCon, I presented about the 5-sided prism that polarizes Web API development:

How do you tackle your API development? Are you diving head-first in the code to get something quickly out the door? Do you start by defining the API contract, that you'll share between your teams and the consumers? Perhaps you prefer to describe your acceptance tests, explaining the behavior you expect from your API. But if you're a storyteller, you'll probably write some use cases, scenarios, to have a better feel for what your API is all about, and how your users will take advantage of it. Or simply, you already have data lying around that wants to set free, and be exposed restfully to the world.

In this session, Guillaume Laforge, Restlet's Product Ninja & Advocate, will highlight different approaches to Web API development, along with their pros & cons. Whether you're starting with code, a contract, tests, documentation, or data, you'll get a glimpse of light into the tasty book of API development recipes.

© 2012 Guillaume Laforge | The views and opinions expressed here are mine and don't reflect the ones from my employer.