BT

Google BigQuery Now Allows to Query All Open-Source Projects on GitHub

| by Sergio De Simone Follow 14 Followers on Jul 08, 2016. Estimated reading time: 2 minutes |

A full snapshot of more than 2.8 million open source project hosted on GitHub is now available in Google’s BigQuery, Google and GitHub announced. This will make it possible to query almost 2 billion source files hosted on GitHub using SQL.

GitHub’s BigQuery dataset is based on the GitHub Archive Project, a project that aims to take snapshots of GitHub at specific points in time, and to store and make them accessible for further analysis. Thanks to GitHub’s BigQuery dataset, now the content of the GitHub Archive Project is readily available through arbitrary SQL-like queries.

According to Arfon Smith, program manager for open source data at GitHub, the new BigQuery dataset could be used for example to find out which are the most commonly used Go packages, or which US schools have the most open source contributors. He also says that it can be useful overall to researchers studying open source communities, or the latest trends in development.

Google developer advocate Felipe Hoffa adds a few more examples of possible uses, such as finding every project that is using a given open source library, or analyzing the way it is being used to collect useful data to decide about that library's future development.

In a post on Medium, Hoffa lists a few queries that have been created by Google engineers and others to analyze Go programs, find the most used Java imports, the top angular directives, and the top emacs packages.

GitHub’s BigQuery dataset contains about 1.5TB of data and is automatically updated every hour. To get started with it:

Google provides 1 TB of data processed per month free of charge, but, as Google developer advocate warns, a single query against the main dataset (bigquery-public-data:github_repos.contents) will consume the free terabyte. Instead, he suggests using the 23GB official extract (bigquery-public-data:github_repos.sample_contents) or any of the language-focused extracts for popular languages such as Go, ruby, JavaScript, PHP, Python, and Java that Google is making available. BigQuery can also be used to create custom datasets, but in this case the user will be charged for its storage.

Google BigQuery Public Datasets is a collection of datasets Google makes available through BigQuery under a special plan where users are only charged for the queries they perform, but not for the dataset storage. Other datasets available among Google BigQuery Public Datasets are USA names, Hacker News stories and comments since 2006, global climatology data between 1029 and 2016, and more.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

How does this work? by Heng Cao

How does query to 1.5GB dataset use 1TB cap? BigQuery does multiple scans?

Re: How does this work? by Sergio De Simone

Thanks for pointing out this: the dataset is actually 1.5TB, not 1.5GB...
I fixed that in the post.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

2 Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT