Import Google Cloud Storage to Google Bigquery

Import Google Cloud Storage to Google Bigquery

In one of my projects, I use Google Bigquery to select and process terabyte data (3TB). I tried IBM Netezza (https://www-01.ibm.com/software/data/netezza/) but the performance is not satisfactory. Later, I tried to upload all my data to Google Cloud Storage using gsutil (https://cloud.google.com/storage/docs/gsutil) which take 5 hours to upload 3TB of data. Next, I need to import all the text files data into Google Biquery, I used the following codes (a full version is available in here: https://github.com/hellolvn1/GoogleBigQueryUpload). Here are the steps:

  • Create a job:Job job = new Job();
    JobConfiguration config = new JobConfiguration();
    JobConfigurationLoad loadConfig = new JobConfigurationLoad();
    config.setLoad(loadConfig);

    job.setConfiguration(config);

     

  • Load the files from Google Cloud Storage:

sources.add(“gs://tensorflowredditapp/” + year + “/RC_” + year + “-” + monthStr + “-comments-grams” + gram);
loadConfig.setSourceUris(sources);
loadConfig.setIgnoreUnknownValues(false);
loadConfig.setAllowJaggedRows(false);
loadConfig.setFieldDelimiter(“,”);
loadConfig.setMaxBadRecords(99999999);

  • Set up a new table

TableReference tableRef = new TableReference();
tableRef.setDatasetId(“RedditGram”);
tableRef.setTableId(“GRAM_” + year + “_” + monthStr + “_” + gram);
tableRef.setProjectId(projectId);
loadConfig.setDestinationTable(tableRef);

List<TableFieldSchema> fields = new ArrayList<TableFieldSchema>();
TableFieldSchema fieldFoo = new TableFieldSchema();
fieldFoo.setName(“WORD”);
fieldFoo.setType(“string”);
TableFieldSchema fieldBar = new TableFieldSchema();
fieldBar.setName(“COUNT”);
fieldBar.setType(“integer”);
fields.add(fieldFoo);
fields.add(fieldBar);
TableSchema schema = new TableSchema();
schema.setFields(fields);
loadConfig.setSchema(schema);

  • Execute the job

Insert insert;

insert = bigquery.jobs().insert(projectId, job);
insert.setProjectId(projectId);
JobReference jobRef = insert.execute().getJobReference();

System.out.format(“\nJob ID of Query Job is: %s\n”, jobRef.getJobId());

 


anh

Leave a Reply

Your email address will not be published. Required fields are marked *