Devoxx: Machine Intelligence at Google Scale: Vision/Speech API, TensorFlow and Cloud Machine Learning

With my colleague Martin Görner, at the Devoxx conference in Belgium last month, we gave a talk on Machine Learning, on the various APIs provided by Google Cloud, the TensorFlow Machine Learning Open Source project, the Cloud ML service. I didn't get a chance to publish the slides, so it's time I fix that!

Machine Intelligence at Google Scale: Vision/Speech API, TensorFlow and Cloud Machine Learning

The biggest challenge of Deep Learning technology is the scalability. As long as using single GPU server, you have to wait for hours or days to get the result of your work. This doesn't scale for production service, so you need a Distributed Training on the cloud eventually. Google has been building infrastructure for training the large scale neural network on the cloud for years, and now started to share the technology with external developers. In this session, we will introduce new pre-trained ML services such as Cloud Vision API and Speech API that works without any training. Also, we will look how TensorFlow and Cloud Machine Learning will accelerate custom model training for 10x - 40x with Google's distributed training infrastructure.

The video is available on YouTube (in particular, don't miss the cool demos!):


And you can look at the slides here:

Analyzing half a million Gradle build files

Gradle is becoming the build automation solution of choice among developers, in particular in the Java ecosystem. With the Github archive published as a Google BigQuery dataset, it's possible to analyze those build files, and see if we can learn something interesting about them!

This week, I was at the G3 Summit conference, and presented about this topic: I covered the Apache Groovy language, as per my previous article, but I expanded my queries to also look at Grails applications, and Gradle build files. So let's see what the dataset tells us about Gradle!

Number of Gradle build files and repositories

Instead of going through the whole Github dataset, I'm going to restrict the dataset by saving only the Gradle build files in my own, smaller, dataset:

SELECT * FROM [bigquery-public-data:github_repos.files] 
WHERE RIGHT(path, 7) = '.gradle'

This query returns only the files whose extension is .gradle. I'm saving the results in my [github.gradle_build_files] table.

But I also need the content of those files:

SELECT *
FROM [bigquery-public-data:github_repos.contents]
WHERE id IN (SELECT id FROM [github.gradle_build_files])

And I will save the content in the table [github.gradle_build_contents].

Let's start with a simple query to count the Gradle build files on Github:

SELECT COUNT(*) as count
FROM [github-groovy-files:github.gradle_build_files]

There are 488,311 Gradle build files! Roughly half a million.

This is the number of Gradle files: note that a project can contain several build files, that a repository can contain several projects, but also that the Github dataset only provides data on repositories for which it could detect an Open Source license. So it gives an idea of the reach of Gradle, but doesn't necessarily give you the exact number of Gradle-based projects in the wild! (and obviously can't even account for the projects hosted internally and elsewhere)

Since a repository can contain several build files, let's have a look at the number of repositories containing Gradle build files:

SELECT COUNT(repo_name) as repos
FROM (
  SELECT repo_name
  FROM [github-groovy-files:github.gradle_build_files]
  GROUP BY repo_name
)

There are 102,803 repositories with Gradle build files.

I was curious to see the distribution of the number of build files across projects. So I used the quantiles function:

SELECT QUANTILES(buildFilesCount, 101) 
FROM (
  SELECT repo_name, COUNT(repo_name) as buildFilesCount
  FROM [github-groovy-files:github.gradle_build_files]
  GROUP BY repo_name
  ORDER BY buildFilesCount DESC
)

I used a small increment (one percent), as the data was skewed towards some repositories with a huge amount of Gradle build files: essentially repositories like the Udemy course on Gradle for Android, or an online book about Android development, as they had tons of small build files or variations of build files with incremental changes for explanation purpose.

22% of the repositories had only 1 build file
85% of the repositories had up to 5 build files
95% of the repositories had less than 10 build files

The repository with the biggest amount of build files had 1333 of them!

Gradle vs Maven

You might also be interested in comparing Gradle and Maven, as they are often put against each other in holy build wars. If you look at the number of pom.xml files on Github:

SELECT count(*) 
FROM [bigquery-public-data:github_repos.files]
WHERE path LIKE '%pom.xml'

There are about 1,007,705 pom.xml files vs the 488,311 we counted for Gradle. So roughly twice as many for Maven.

But if you look at the number of repositories with Maven build files:

SELECT COUNT(repo_name) as repos
FROM (
  SELECT repo_name
  FROM [bigquery-public-data:github_repos.files]
  WHERE path LIKE '%pom.xml'
  GROUP BY repo_name
)

There are 131,037 repositories with Maven pom.xml files, compared to the 102,803 repositories with Gradle build files we counted earlier (about only 27% more). It seems Gradle is catching up with Maven!

Gradle build file names

Bigger projects tend to split their build tasks under different build files. I was curious to see which kind of split developers did by looking at the most frequent build file names:

SELECT f, COUNT(f) as count
FROM (
  SELECT LAST(SPLIT(path, '/')) AS f
  FROM [github-groovy-files:github.gradle_build_files]
)
GROUP BY f
ORDER BY count DESC


Of course, build.gradle comes first. Followed by settings.gradle. Notice the number of build files which are related to making releases, publishing / deploying the artifacts to a repository. There are also a few checking the quality of the code base, using checkstyle for style violations, JaCoCo for code coverage.

Gradle versions

Gradle projects often use the Gradle wrapper to help developers use a particular and consistent version of Gradle, without necessiting Gradle to be installed locally. For those developers who decided to commit their Gradle wrapper in Github, we can have a look at the breakdown of Gradle versions currently in the wild:

SELECT version, COUNT(version) AS count
FROM (
  SELECT REGEXP_EXTRACT(line, r'gradle-(.*)-(?:all|bin).zip') AS version
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_wrapper_properties_files]
  )
  WHERE line LIKE 'distributionUrl%'
)
GROUP BY version
ORDER BY count DESC


It looks like Gradle 2.4 was a big hit!

Gradle plugins

Gradle projects often take advantage of third-party plugins. You'll see plugins declared with the "id" syntax or applied with "apply plugin". Let's looked at both:

SELECT plugin, COUNT(plugin) AS count
FROM (
  SELECT REGEXP_EXTRACT(line, r'apply plugin: (?:\'|\")(.*)(?:\'|\")') AS plugin
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
  )
)
GROUP BY plugin
ORDER BY count DESC


Look at the big number of Android related plugins! Clearly, Android adopting Gradle as build solution gave a big boost to Gradle's adoption!

The plugins declared with "id" show another story though: 

SELECT newplugin, COUNT(newplugin) AS count
FROM (
  SELECT REGEXP_EXTRACT(line, r'id (?:\'|\")(.*)(?:\'|\") version') AS newplugin
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
  )
)
GROUP BY newplugin
ORDER BY count DESC


Here, we see a big usage of the Bintray plugin and the shadow plugin.

Build dependencies

Now it's time to look at dependencies. First, the "compile" dependencies:

SELECT dep, COUNT(dep) AS count
FROM (
  SELECT REGEXP_EXTRACT(line, r'compile(?: |\()(?:\'|\")(.*):') AS dep
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
  )
)
GROUP BY dep
ORDER BY count DESC


Again, there's a big usage of Android related dependencies. We also notice Spring Boot, GSON, Guava, SLF4J, Retrofit, Jackson.

For the test dependencies:

SELECT dep, COUNT(dep) AS count
FROM (
  SELECT REGEXP_EXTRACT(line, r'testCompile(?: |\()(?:\'|\")(.*):') AS dep
  FROM (
    SELECT SPLIT(content, '\n') AS line
    FROM [github-groovy-files:github.gradle_build_contents]
  )
)
GROUP BY dep
ORDER BY count DESC


No big surprise with JUnit coming first. But we have Spock, Mockito's mocking library, AssertJ assertions, Hamcrest matchers.

Summary

And this wraps up our analysis of Gradle build files, thanks to Google BigQuery and the Github dataset. It's interesting to see that Gradle has gained a very significant market share, coming pretty close to the Maven incumbent, and to see lots of Android projects are on Github with Gradle builds.

Analyzing Gradle, Grails, and Apache Groovy source code hosted on Github with BigQuery

A few months ago, I wrote an article about what you can learn from millions lines of Apache Groovy source hosted on Github, thanks to Google BigQuery. We answered a few questions like:

  • How many Groovy files are there on Github?
  • What are the most popular Groovy file names?
  • How many lines of Groovy source code are there?
  • What's the distribution of size of source files?
  • What are the most frequent imported packages?
  • What are the most popular Groovy APIs used?
  • What are the most used AST transformations?
  • Do people use import aliases much?
  • Did developers adopt traits?
At G3 Summit this week, I gave a presentation on this source code analysis, but decided to expand it a little bit, by also adding queries about Grails and Gradle.

For Gradle, here are the questions that I answered:

  • How many Gradle build files are there?
  • How many Maven build files are there?
  • Which versions of Gradle are being used?
  • How many of those Gradle files are settings files?
  • What are the most frequent build file names?
  • What are the most frequent Gradle plugins?
  • What are the most frequent “compile” and “test” dependencies?
And for Grails, here's what I covered:

  • What are the most used SQL database used?
  • What are the most frequent controller names?
  • What are the repositories with the biggest number of controllers?
  • What is the distribution of number of controllers?
I'll come back on those new queries in subsequent articles! But in the meantime, let me show you the slides I presented, and the results of those queries.


My G3 Summit Apache Groovy keynote

This week, I'm in Florida for the brand new G3 Summit conference, dedicated to the Apache Groovy ecosystem (Grails, Gradle, and more). I had the chance of giving the keynote, where I gave an overview of the Apache Groovy project's philosophy, history, and where it's heading. In the second part, I'm showcasing the new features, new syntax constructs, already there or coming in Groovy 2.4.x, in the future Groovy 2.5, and in Groovy 3.0 with the new parser.


Billions of lines of code in a single repository, seriously?

When I joined Google last June, I discovered a new world: tons of new acronyms or project code names to learn about, but also a particular environment for your source code. At Google, engineers work on a huge monolithic source code repository comprising of: 
  • 1 billion files
  • 9 million source files
  • 2 billion lines of code
  • 35 million commits
  • 86 terabytes of content
  • 45 thousands of commits every day.
Rachel Potvin, who's an engineering manager at Google, wrote an article for ACM about how Google handles such a huge repository, as well as the tools and practices around that. Wired also covered the topic in their article "Google is 2 billion lines of code and it's all in one place". And Rachel also presented this topic at the @Scale conference.

I had the chance to give here presentation at Devoxx Belgium

You can find the slide deck embedded below:

And the talk was also recorded, so you can view the video on Devoxx's YouTube channel here:


 
© 2012 Guillaume Laforge | The views and opinions expressed here are mine and don't reflect the ones from my employer.