Inject your SpringBoot app Data in Google BigQuery

Nicolas Raynaud
7 min readOct 15, 2020

--

We all know data is king ! But to be analyzed, data needs to be injected and centralized in data warehouses first. This tutorial covers a standard use case where you need to extract data from your Java SpringBoot application and input it in Google BigQuery datasets to be later analyzed.

The use case here is to make a generic set of services or components that can query data from a database and input it in a BigQuery table either by truncating or updating any previous data.

1. Generate a simple Spring Boot project

For this tutorial to work, I’ll assume you already know how to setup a proper SpringBoot app with a JPA datasource (whatever flavour you feel like…).

If you use Spring initializr, make sure you chose the following dependencies :

  • Spring Data JPA
  • GCP Support (Google Cloud Platform)

2. Create a Google Cloud Platform service account

To use Google Cloud Platform from your project, you’ll need valid credentials. The proper way to set this up is by creating a service account dedicated to your app and set its permissions to the only resources it’ll need.

Here is a tutorial of how to generate a google cloud platform service account :

  1. In your Google Cloud Platform project, go to “IAM & Admin” / “Service Accounts”

2. Click on “Create service account” and name it properly so you’ll understand what it does, just by looking at it.

3. Select the right permissions that will later allow us to create/update tables in a BigQuery dataset (BigQuery Data Editor and Job User) :

4. Skip the next step (it is not required for this tutorial)

5. Service Account is created, you will now need to download its access key, that will later be used to configure your app. Click on“Create key”

6. Select “JSON” type (this is the default) and “Create”

7. You’ll then be able to download the JSON file for this service account. Make sure you store this file in a secured environment as this represents the credentials of the service account and will allow whoever has it to access your GCP account.

3. Create a BigQuery dataset

1. Create A BigQuery Dataset that will hold your data. In “BigQuery”, select your project (here spring-boot-to-bigquery), and click “Create Dataset”

2. Fill in a name and a location : the choice of the data location is important as it cannot be changed later. Also, if you already have datasets and want to use query across several of them, you need to have them in the same location

We know have a dataset named “spring_app”. We’ll use that name later on.

4. Setup Spring Boot Google Cloud Platform integration

You now should have a project with a build.gradle that looks a lot like this :

Well, you’ll need to add another dependency to tell your project that you are going to integrate BigQuery :

// Integrate with BigQuery
implementation 'org.springframework.cloud:spring-cloud-gcp-starter-bigquery'

And another one to write CSV files more easily :

// Write CSV Files
compile 'org.apache.commons:commons-csv:1.8'

Now, remember the GCP Service Account json file you downloaded ? You are going to need it now.

To let Spring know what credentials to use when accessing GCP services, you have several choices (here is the Spring documentation, or the short version below) :

  • Specify an environment variable with the location of the file :
GOOGLE_APPLICATION_CREDENTIALS=/path/to/file.json
  • Embed the json file with your project, and reference it using configuration (assuming you are using application.properties)
spring.cloud.gcp.credentials.location=file:/usr/local/key.json
  • Or specify the file content directly in Base 64 (assuming you are using application.properties):
spring.cloud.gcp.credentials.encoded-key=ewogICJ0eXBlIjogInNlcnZpY2VfYWNjb3VudCIsCiAgInByb2plY3RfaWQiOiAiaGFyZGJhY29uLTI0MzMyMCIsCiAgInByaXZhdGVfa2V5X2lkIjogIjViZmY5YzQ2NGFlOGIxZjc1ZmU0MTBmNzhkZTY1ZDBmM2M4M2YxZDki....

The last way is my favorite since it allows for a quick deployment without any additional file to deploy with your server. Of course, depending on your production environments and your security requirements, you might prefer one the former way.

By the way, should you chose the last option, here is a command to encode your file in base 64 :

$ cat spring-boot-to-bigquery-a125027b0ed6.json | base64
ewogICJ0eXBlIjogInNlcnZpY2VfYWNjb3VudCIsCiAgInByb2plY3RfaWQiOiAic3ByaW5nLWJvb3QtdG8tYmlncXVlcnkiLAogICJwcml2YXRlX2tleV9pZCI6ICJhMTI1MDI3YjBlZDZiMTBmOD…

Finally, you also need to specify what GCP project your app will be connecting to, and the dataset you’ll be writing data to (the one we created earlier), so your application.properties now looks like this :

spring.cloud.gcp.project-id=spring-boot-to-bigquery
spring.cloud.gcp.credentials.encoded-key=ewogICJ0eXBlIjogInNlcnZpY2VfYWNjb3VudCIsCiAgInByb2plY3RfaWQiOiAic3ByaW5nLWJvb3QtdG8tYmlncXVlcnkiLAogICJwcml2YXRlX2tleV9pZCI6ICJhMTI1MDI3YjBlZDZiMTBmODA3MTUy...
spring.cloud.gcp.bigquery.dataset-name=spring_app

5. The Scenario

Here is our data model : We have users. Each of them has one device and may have subscriptions.

The stat we want to extract is the number of subscriptions per Operating System to answer the mighty question : are iOS users buying more than Android’s ?.

This data will be better stored and analyzed in BigQuery, and not directly on our production database. So what we need is to do is extract enough data to be stored in BigQuery, and do our reporting from there.

Obviously, we don’t need not the whole DB to compute our statistic. Here I think we’ll just need to extract the subscriptions, the date they’re bought and the OS of the device associated to the user making the transaction. So our target data schema will be : (note that we are extracting the date so that we can track the evolution of subscriptions over time).

subscriptionId (NUMERIC)
dateBought (DATE)
deviceOS (STRING)

5. How to do it

By default, in a free tiers GCP account, the only way to inject data in BigQuery is through file input. Inputing data as objects, one by one, using the API is only available to paid tiers.
So, our example will make use of CSV files to inject data in bulk into data tables.

Basically we will :

  • Query for data using a JDBCTemplate
  • Read the results into a CSV file
  • Inject the CSV file into BigQuery

6. The SQL Query and creation of a CSV File

This is a basic process, but i’ll give you my code just in case it can help someone going through the same process…

Here is the query we need to use to extract sufficient data.

SELECT s.id as subscriptionId, s.dateBought, d.OS as deviceOS 
FROM subscription s INNER JOIN device d ON d.user_id = s.user_id;

So now, let’s create a service that takes that query, a schema and produces a CSV file containing the data :

Of course, all the magic is in the `CsvResultSetExtractor` :)

This class takes a JDBCTemplate ResultSet and converts it into a flat CSV file using Apache Commons CSV library. It uses a schema provided as a list of Pairs containing the column names and their datatype. I have chosen to use the standard SQL datatypes names such as “NUMERIC”, “DATE” and “STRING”. You could extend do any type you like and probably use enums and a better error handling, but that’s not the point of this post !

So, now, we are able to extract any query from the DB into a CSV file

7. Implement a BigQuery Connector

We will create a service that accepts CSV files with a schema as input, and upload it on our BigQuery dataset, creating the data table matching the schema if not existing.

This is pretty straightforward once you know what setting to use !

And that’s it !

Now we can chain those two calls and create a BigQuery data table from an SQL query.

Of course you can use either part of this process to produce custom CSV files or input data to BigQuery coming from any source, considering it takes plain old CSV files as input !

8. See the result !

After running the service, it will show the new table in your BigQuery dataset :

9. Conclusions and going forward

While it might not be the most beautiful and neat way to extract data from your app to BigQuery, it layouts the basics of how to do so without much pain.

Once these services are setup, why not use Spring Batch to schedule such an export. You could then schedule these data updates to have up to date data in your BigQuery dataset flawlessly.

You could also date your SQL queries (add a date parameter) and chose not to truncate BigQuery data tables when uploading : this would for sure improve speed and efficiency of the whole process since the current code will probably suffer from memory issues when your data reach a certain size.

You can find the code in my GitHub repo here https://github.com/nicoraynaud/data-to-bigquery. Feel free to comment or let me know whether I can improve or correct this to better help anyone who faced the same task !

--

--

Nicolas Raynaud

I am a curious developer working hard to follow Software Craftsmanship principles.