Get your data into BigQuery
Get your data into BigQuery
1. Decide whether to use batch loading or streaming for your data.
Batch loading: Load your data periodically, like once per day or once per week. Use cases: load the contacts from your website once per day for the reporting and analysis, load the sales from your CRM once per month to create monthly reports. Streaming: Constantly send your data to the BigQuery project, every time the new data is available; when you need real-time monitoring. Use cases: fraud detection, mobile application usage when you want to detect errors and bugs immediately.
2. Choose a Google service, if applicable, to help to get your data into BigQuery.
BigQuery Data Transfer Service: Google Play, Google Ads (2.5$ per month per customer ID), YouTube; connects to some Google and non-Google (Amazon S3) tools. Google Analytics 360/Google Analytics 4: Digital analytics platforms that connect naturally to BigQuery. Cloud Storage: Send your CSV files and then retrieve them in BigQuery. Google Cloud Dataflow: Build the data pipelines for streaming with Cloud Pub/Sub and batch loading. Google Cloud Function: Allows you to react to an event with an automatic action: for example, you can trigger the function when a CSV file is added to your Cloud Storage, and send this data to your BigQuery table. Google Cloud Composer: Helps you to author, schedule, and monitor pipelines. Google Data Fusion: Data integration service for moving your data from some Google and non-Google services into BigQuery. You can use Cloud API and client libraries to connect your data source to Cloud Storage and BigQuery.
3. Load your file manually into BigQuery by creating a table and uploading it from your computer.
Click on your dataset where you want to create your table, then click on Create Table (+). Browse for the file on your computer and give a name to your table. Click on Auto detect Schema, then input parameters so that BigQuery will autodetect your field data types. Click on Create table. You will now see your table available in your dataset:
4. Connect Google Analytics 4 to your BigQuery by clicking on BigQuery Linking > Link at the property level.
Choose your destination project and data location. Choose the frequency of the data upload; daily or streaming. you will need streaming only if you need to analyze the analytics data in real time, which is quite rare. Review and submit. You will see a new dataset created in your BigQuery project quite soon – it will be called analytics-XXXXXXX with your property number.
5. Get your analytics or advertising data into BigQuery using third-party connectors like Owox, Supermetrics, Funnel IO, and StitchData.
Third-party connectors give you the possibility to connect known and widely used data sources like ads platforms, CRMs, and call tracking solutions, to your BigQuery project. In this example we’ll use Owox connector as an example of connecting a Facebook advertising account to your BigQuery project. In your Owox account click on New Pipeline. Select Facebook Ads. Select BigQuery as the data destination. Provide access to your Facebook account and to BigQuery. Select BigQuery dataset that will receive the data or create this dataset. Choose the start date for your import. You will soon see the Facebook data in your dataset in BigQuery.
6. Use DML/DDL to generate data as one of the methods to create tables in your BigQuery Data Warehouse.
You can query the existing data to create your new table. For example, query the public data: SELECT payment_type, count(unique_key) nb_trips FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips` group by payment_type order by nb_trips DESC And then add a line to create a new table payment_types in the test_dataset dataset in your fast-fire-262114 project: CREATE TABLE `fast-fire-262114.test_dataset.payment_types` AS
7. If you have a custom data source that doesn't connect with BigQuery easily, but this data source can generate the CSV files, you can use Cloud Storage and Cloud Function to get your data into BigQuery.
Ask your developers to send you the CSVs files in batch mode, once per day, in your Cloud Storage bucket. They can use Google Cloud APIs and client libraries to connect to Google Cloud Storage. Create the Cloud Storage Bucket that will receive these files, and give your developers its name. To create a Bucket: Go to Storage and click on Create Bucket and give it a unique name Choose where to store your data, you’ll need a Region location type and the location closest to you. Choose a Standard storage class and a fine-grained access control. Leave other settings unchanged. Click on Create. Your bucket is ready to receive the information. Your developers can start sending the CSV files to your cloud storage bucket.
8. Configure a dataset and a table that will receive your data, so that BigQuery detects the schema automatically.
Go to Cloud Functions. Click Create and give your function an explicit name, choose a Region that corresponds to your Storage region, choose Cloud Storage as Trigger type and Finalize/Create as Event type: Save and continue. Choose Python 3.7 and in the requirements.txt, add the following code: google-cloud-bigquery google-cloud-storage
9. Add code to import a CSV file into BigQuery to main.py.
Change the following elements to your own: your-bucket to your own bucket name. your-file-name.*.csv to your file name that will be loaded to Cloud Storage; You can use regular expressions if your file name changes, for example 2020-d{2}-d{2}-d+-contacts.*.csv your-dataset and your_table to your dataset and table names. """Import a csv file into BigQuery.""" import logging import os import re from google.cloud import bigquery from google.cloud import storage GCP_PROJECT = ‘your_project' def bigqueryImport(event, context): """Import a csv file into BigQuery.""" # get storage update data bucketname = ‘your-bucket‘ filename = event['name'] if not re.search(‘your-file-name.*.csv’, filename): logging.error('Unrecognized filename format: %s' % (filename)) return # parse filename datasetname = ‘your-dataset' table_id = '%s.%s.%s' % (GCP_PROJECT, ‘your_dataset', ‘your_table') # log the receipt of the file uri = 'gs://%s/%s' % (bucketname, filename) # create bigquery client client = bigquery.Client() # get dataset reference dataset_ref = client.dataset(datasetname) # check if dataset exists, otherwise create try: client.get_dataset(dataset_ref) except Exception: logging.warn('Creating dataset: %s' % (datasetname)) client.create_dataset(dataset_ref) # create a bigquery load job config job_config = bigquery.LoadJobConfig() job_config.autodetect = True job_config.create_disposition = 'CREATE_IF_NEEDED', job_config.write_disposition = 'WRITE_APPEND' #deleting file storage_client = storage.Client() #gcs = require('@google-cloud/storage')() bucket = storage_client.get_bucket(bucketname) file = bucket.blob(filename) #const pr = file.delete() # create a bigquery load job try: load_job = client.load_table_from_uri( uri, table_id, job_config=job_config, ) print('Load job: %s [%s]' % ( load_job.job_id, table_id )) load_job.begin() load_job.result() file.delete() except Exception as e: logging.error('Failed to create load job: %s' % (e)) Click on Deploy to deploy your function.