My favorite Cloud Next sessions

The schedule for Google Cloud Next was unveiled this week, and there's lots of interesting sessions to attend. With the many parallel tracks, it's difficult to make a choice, but I wanted to highlight some of the talks I'd like to watch!

The Google Cloud Platform is a pretty rich one, with many options for your compute needs. How do you choose which one is best for your use case? Brian Dorsey covers this in detail in this session:

To explore a bit further some of the compute options, I'd recommend looking at Container Engine with ABCs of Google Container Engine: tips and best practices by Piotr Szczesniak, and Go beyond PaaS with App Engine Flexible Environment by Justin Beckwith.

The Serverless trend is strong these days, and in this area, I spotted two slots here with Firebase, Cloud Functions: Live coding a serverless app with Firebase and Google Cloud Platform by Mike McDonald, Jen Tong, Frank van Puffelen, and Serverless computing options with Google Cloud Platform by Bret McGowen.

I've blogged before about Cloud Endpoints, as I'm interested in the world of Web APIs, and there are two talks I'd like to attend in this area: Google Cloud Endpoints: serving your API to the world by Francesc Campoy Flores and Authorizing service-to-service calls with Google Cloud Endpoints by Dan Ciruli, Sep Ebrahimzadeh.

And in my misc. category, I'd like to highlight this one on the APIs for G Suite: Developing new apps built for your organization with Google Docs, Slides, Sheets and Sites APIs by Ritcha Ranjan. A talk on big parallel data processing with Using Apache Beam for parallel data processing by Frances Perry.

And to finish, I have to mention my own talk, that I'll be presenting with Brad Abrams: Talking to your users: Build conversational actions for Google Assistant. It should be fun!

What talks are you going to attend?





Deploy a Ratpack app on Google App Engine Flex

The purpose of this article is to deploy a Ratpack web application on Google App Engine Flex.

For my demos at conferences, I often use frameworks like Ratpack, Grails or Gaelyk, which are based on the Apache Groovy programming language. In a previous article, I already used Ratpack, but on a slightly more complex use case, but this time, I want to share a quick Ratpack hello world, and deploy it on Flex.

I started with a hello world template generated by Lazybones (a simple project creation tool that uses packaged project templates), that I had installed with SDKman (a tool for managing parallel versions of multiple Software Development Kits). But you can go ahead with your own Ratpack apps obviously. Feel free to skip the next section if you already have an app.

Create a Ratpack project
# install SDKman
curl -s "https://get.sdkman.io" | bash
# install lazybones with sdkman
sdk install lazybones
# create your hello world Ratpack app from a template
lazybones create ratpack flex-test-1
You can then quickly run your app with:
cd flex-test-1
./gradlew run
And head your browser to http://localhost:5050 to see your app running.

We'll use the distTar task to create a distribution of our app, so build it with:
./gradlew distTar

Get ready for Flex

To run our app on App Engine Flex, we'll need to do two things: 1) to containerize it as a Docker container, and 2) to create an app.yaml app descriptor. Let's start with Docker. Create a Dockerfile, and adapt the path names appropriately (replace "flex-test-1" by the name of the directory you created your project in):
FROM gcr.io/google_appengine/openjdk8
VOLUME /tmp
ADD build/distributions/flex-test-1.tar /
ENV JAVA_OPTS='-Dratpack.port=8080 -Djava.security.egd=file:/dev/./urandom'
ENTRYPOINT ["/flex-test-1/bin/flex-test-1"]
I'm using Open JDK 8 for my custom runtime. I add my tarred project, and specify port 8080 for running (as requested by Flex), and I define the entry point to my generated startup script.

My app.yaml file, for App Engine Flex, is pretty short, and expresses that I'm using the Flexible environment:
runtime: custom
env: flex
threadsafe: true
Create and deploy your project on Google Cloud Platform

Create an App Engine project on the Google Cloud Platform console. And note the project name. You should also install the gcloud SDK to be able to deploy your Ratpack app from the command-line. Once done, you'll be able to go through the deployment with:
gcloud app deploy
After a little while, your Ratpack should be up and running!

A poor-man assistant with speech recognition and natural language processing

All sorts of voice-powered assistants are available today, and chat bots are the new black! In order to illustrate how such tools are made, I decided to create my own little basic conference assistant, using Google's Cloud Speech API and Cloud Natural Language API. This is a demo I actually created for the Devoxx 2016 keynote, when Stephan Janssen invited me on stage to speak about Machine Learning. And to make this demo more fun, I implemented it with a shell script, some curl calls, plus some other handy command-line tools.

So what is this "conference assistant" all about? Thanks for asking. The idea is to ask questions to this assistant about topics you'd like to see during the conference. For example: "Is there a talk about the Google Cloud Vision API?". You send that voice request to the Speech API, which gives you back the transcript of the question. You can then use the Natural Language API to process that text to extract the relevant topic in that question. Then you query the conference schedule to see if there's a talk matching the topic.

Let's see this demo into action, before diving into the details:

So how did I create this little command-line conference assistant? Let's start with a quick diagram showing the whole process and its steps:


  • First, I record the audio using the sox command-line tool.
  • The audio file is saved locally, and I upload it to Google Cloud Storage (GCS).
  • I then call the Speech API, pointing it at my recorded audio file in GCS, so that it returns the text it recognized from the audio.
  • I use the jq command line tool to extract the words from the returned JSON payload, and only the words I'm interested in (basically what appears after the "about" part of my query, ie. "a talk *about* machine learning")
  • Lastly, I'm calling a custom search engine that points at the conference website schedule, to find the relevant talks that match my search query.
Let's have a look at the script in more details (this is the simplified script without all the shiny terminal colors and logging output). You should create a project in the Google Cloud Console, and note its project ID, as we'll reuse it for storing our audio file.

#!/bin/bash

# create an API key to access the Speech and NL APIs
# https://support.google.com/cloud/answer/6158862?hl=en
export API_KEY=YOUR API KEY HERE # create a Google Custom Search and retrieve its id
export CS_ID=THE ID OF YOUR GOOGLE CUSTOM SEARCH # to use sox for recording audio, you can install it with: # brew install sox --with-lame --with-flac --with-libvorbis
sox  -d -r 16k -c 1 query.flac # once the recording is over, hit CTRL-C to stop # upload the audio file to Google Cloud Storage with the gsutil command # see the documentation for installing it, as well as the gcloud CLI # https://cloud.google.com/storage/docs/gsutil_install # https://cloud.google.com/sdk/docs/
gsutil copy -a public-read query.flac gs://devoxx-ml-demo.appspot.com/query.flac
# call the Speech API with the template request saved in speech-request.json: # { # "config": { # "encoding":"FLAC", # "sample_rate": 16000, # "language_code": "en-US" # }, # "audio": { # "uri":"gs://YOUR-PROJECT-ID-HERE.appspot.com/query.flac" # } #}
curl -s -X POST -H "Content-Type: application/json" --data-binary @speech-request.json "https://speech.googleapis.com/v1beta1/speech:syncrecognize?key=${API_KEY}" > speech-output.json # retrieve the text recognized by the Speech API # using the jq to just extract the text part
cat speech-output.json | jq -r .results[0].alternatives[0].transcript > text.txt # prepare a query for the Natural Language API # replacing the @TEXT@ place holder with the text we got from Speech API; # the JSON query template looks like this: # { # "document": { # "type": "PLAIN_TEXT", # "content": "@TEXT@" # }, # "features": { # "extractSyntax": true, # "extractEntities": false, # "extractDocumentSentiment": false # } #} sed "s/@TEXT@/`cat text.txt`/g" nl-request-template.json > nl-request.json # call the Natural Language API with our template
curl -s -X POST -H "Content-Type: application/json" --data-binary @nl-request.json https://language.googleapis.com/v1beta1/documents:annotateText?key=${API_KEY} > nl-output.json # retrieve all the analyzed words from the NL call results
cat nl-output.json | jq -r .tokens[].lemma  > lemmas.txt # only keep the words after the "about" word which refer to the topic searched for
sed -n '/about/,$p' lemmas.txt | tail -n +2 > keywords.txt # join the words together to pass them to the search engine
cat keywords.txt | tr '\n' '+' > encoded-keywords.txt # call the Google Custom Search engine, with the topic search query # and use jq again to filter only the title of the first search result # (the page covering the talk usually comes first)
curl -s "https://www.googleapis.com/customsearch/v1?key=$API_KEY&cx=$CS_ID&q=`cat encoded-keywords.txt`" | jq .items[0].title
And voila, we have our conference assistant on the command-line! We combined the Speech API to recognize the voice and extract the text corresponding to the query audio, we analyze this text with the Natural Language API, and we use a few handy command-line tools to do the glue.

Devoxx: Machine Intelligence at Google Scale: Vision/Speech API, TensorFlow and Cloud Machine Learning

With my colleague Martin Görner, at the Devoxx conference in Belgium last month, we gave a talk on Machine Learning, on the various APIs provided by Google Cloud, the TensorFlow Machine Learning Open Source project, the Cloud ML service. I didn't get a chance to publish the slides, so it's time I fix that!

Machine Intelligence at Google Scale: Vision/Speech API, TensorFlow and Cloud Machine Learning

The biggest challenge of Deep Learning technology is the scalability. As long as using single GPU server, you have to wait for hours or days to get the result of your work. This doesn't scale for production service, so you need a Distributed Training on the cloud eventually. Google has been building infrastructure for training the large scale neural network on the cloud for years, and now started to share the technology with external developers. In this session, we will introduce new pre-trained ML services such as Cloud Vision API and Speech API that works without any training. Also, we will look how TensorFlow and Cloud Machine Learning will accelerate custom model training for 10x - 40x with Google's distributed training infrastructure.

The video is available on YouTube (in particular, don't miss the cool demos!):


And you can look at the slides here:

Analyzing half a million Gradle build files

Gradle is becoming the build automation solution of choice among developers, in particular in the Java ecosystem. With the Github archive published as a Google BigQuery dataset, it's possible to analyze those build files, and see if we can learn something interesting about them!

This week, I was at the G3 Summit conference, and presented about this topic: I covered the Apache Groovy language, as per my previous article, but I expanded my queries to also look at Grails applications, and Gradle build files. So let's see what the dataset tells us about Gradle!

Number of Gradle build files and repositories

Instead of going through the whole Github dataset, I'm going to restrict the dataset by saving only the Gradle build files in my own, smaller, dataset:

SELECT * FROM [bigquery-public-data:github_repos.files] 
WHERE RIGHT(path, 7) = '.gradle'

This query returns only the files whose extension is .gradle. I'm saving the results in my [github.gradle_build_files] table.

But I also need the content of those files:

SELECT *
FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (SELECT id FROM [github.gradle_build_files])

And I will save the content in the table [github.gradle_build_contents].

Let's start with a simple query to count the Gradle build files on Github:

SELECT COUNT(*) as count
FROM [github-groovy-files:github.gradle_build_files]

There are 488,311 Gradle build files! Roughly half a million.

This is the number of Gradle files: note that a project can contain several build files, that a repository can contain several projects, but also that the Github dataset only provides data on repositories for which it could detect an Open Source license. So it gives an idea of the reach of Gradle, but doesn't necessarily give you the exact number of Gradle-based projects in the wild! (and obviously can't even account for the projects hosted internally and elsewhere)

Since a repository can contain several build files, let's have a look at the number of repositories containing Gradle build files:

SELECT COUNT(repo_name) as repos
FROM (
  SELECT repo_name
  FROM [github-groovy-files:github.gradle_build_files]
  GROUP BY repo_name
)

There are 102,803 repositories with Gradle build files.

I was curious to see the distribution of the number of build files across projects. So I used the quantiles function:

SELECT QUANTILES(buildFilesCount, 101) 
FROM (
  SELECT repo_name, COUNT(repo_name) as buildFilesCount
  FROM [github-groovy-files:github.gradle_build_files]
  GROUP BY repo_name
  ORDER BY buildFilesCount DESC
)

I used a small increment (one percent), as the data was skewed towards some repositories with a huge amount of Gradle build files: essentially repositories like the Udemy course on Gradle for Android, or an online book about Android development, as they had tons of small build files or variations of build files with incremental changes for explanation purpose.

22% of the repositories had only 1 build file
85% of the repositories had up to 5 build files
95% of the repositories had less than 10 build files

The repository with the biggest amount of build files had 1333 of them!

Gradle vs Maven

You might also be interested in comparing Gradle and Maven, as they are often put against each other in holy build wars. If you look at the number of pom.xml files on Github:

SELECT count(*) 
FROM [bigquery-public-data:github_repos.files]
WHERE path LIKE '%pom.xml'

There are about 1,007,705 pom.xml files vs the 488,311 we counted for Gradle. So roughly twice as many for Maven.

But if you look at the number of repositories with Maven build files:

SELECT COUNT(repo_name) as repos
FROM (
  SELECT repo_name
  FROM [bigquery-public-data:github_repos.files]
  WHERE path LIKE '%pom.xml'
  GROUP BY repo_name
)

There are 131,037 repositories with Maven pom.xml files, compared to the 102,803 repositories with Gradle build files we counted earlier (about only 27% more). It seems Gradle is catching up with Maven!

Gradle build file names

Bigger projects tend to split their build tasks under different build files. I was curious to see which kind of split developers did by looking at the most frequent build file names:

SELECT f, COUNT(f) as count
FROM (
  SELECT LAST(SPLIT(path, '/')) AS f
  FROM [github-groovy-files:github.gradle_build_files]
)
GROUP BY f
ORDER BY count DESC


Of course, build.gradle comes first. Followed by settings.gradle. Notice the number of build files which are related to making releases, publishing / deploying the artifacts to a repository. There are also a few checking the quality of the code base, using checkstyle for style violations, JaCoCo for code coverage.

Gradle versions

Gradle projects often use the Gradle wrapper to help developers use a particular and consistent version of Gradle, without necessiting Gradle to be installed locally. For those developers who decided to commit their Gradle wrapper in Github, we can have a look at the breakdown of Gradle versions currently in the wild:

SELECT version, COUNT(version) AS count
FROM (
  SELECT REGEXP_EXTRACT(line, r'gradle-(.*)-(?:all|bin).zip') AS version
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_wrapper_properties_files]
  )
  WHERE line LIKE 'distributionUrl%'
)
GROUP BY version
ORDER BY count DESC


It looks like Gradle 2.4 was a big hit!

Gradle plugins

Gradle projects often take advantage of third-party plugins. You'll see plugins declared with the "id" syntax or applied with "apply plugin". Let's looked at both:

SELECT plugin, COUNT(plugin) AS count
FROM (
  SELECT REGEXP_EXTRACT(line, r'apply plugin: (?:\'|\")(.*)(?:\'|\")') AS plugin
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
  )
)
GROUP BY plugin
ORDER BY count DESC


Look at the big number of Android related plugins! Clearly, Android adopting Gradle as build solution gave a big boost to Gradle's adoption!

The plugins declared with "id" show another story though: 

SELECT newplugin, COUNT(newplugin) AS count
FROM (
  SELECT REGEXP_EXTRACT(line, r'id (?:\'|\")(.*)(?:\'|\") version') AS newplugin
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
  )
)
GROUP BY newplugin
ORDER BY count DESC


Here, we see a big usage of the Bintray plugin and the shadow plugin.

Build dependencies

Now it's time to look at dependencies. First, the "compile" dependencies:

SELECT dep, COUNT(dep) AS count
FROM (
  SELECT REGEXP_EXTRACT(line, r'compile(?: |\()(?:\'|\")(.*):') AS dep
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
  )
)
GROUP BY dep
ORDER BY count DESC


Again, there's a big usage of Android related dependencies. We also notice Spring Boot, GSON, Guava, SLF4J, Retrofit, Jackson.

For the test dependencies:

SELECT dep, COUNT(dep) AS count
FROM (
  SELECT REGEXP_EXTRACT(line, r'testCompile(?: |\()(?:\'|\")(.*):') AS dep
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
  )
)
GROUP BY dep
ORDER BY count DESC


No big surprise with JUnit coming first. But we have Spock, Mockito's mocking library, AssertJ assertions, Hamcrest matchers.

Summary

And this wraps up our analysis of Gradle build files, thanks to Google BigQuery and the Github dataset. It's interesting to see that Gradle has gained a very significant market share, coming pretty close to the Maven incumbent, and to see lots of Android projects are on Github with Gradle builds.
 
© 2012 Guillaume Laforge | The views and opinions expressed here are mine and don't reflect the ones from my employer.