BigQuery Storage API

Read from and write to BigQuery tables using the Storage API. Higher throughput than the query API for large datasets.

Storage Read

Reads table data directly via the BigQuery Storage Read API. Returns Arrow RecordBatch.

- gcp_bigquery_storage_read:
    name: read_accounts
    credentials_path: /etc/gcp/service-account.json
    project_id: my-project
    dataset_id: salesforce
    table_id: accounts
    selected_fields:
      - id
      - name
      - industry
    row_restriction: "industry = 'Technology'"

Read fields

FieldTypeDefaultDescription
namestringrequiredTask name.
credentials_pathstringGCP service account credentials. Falls back to Application Default Credentials when omitted.
project_idstringrequiredGCP project ID.
dataset_idstringrequiredBigQuery dataset.
table_idstringrequiredBigQuery table.
selected_fieldslistColumns to read (all if omitted).
row_restrictionstringWHERE clause for filtering rows.
sample_percentagefloatRandom sampling percentage.
snapshot_timestringTime-travel query timestamp (RFC 3339).
max_stream_countintMax parallel read streams.
data_formatstringarrowResult format: arrow or avro.
depends_onlistUpstream task names.
retryobjectRetry configuration.

Storage Write

Streams data into BigQuery tables via the Storage Write API. Accepts Arrow RecordBatch input.

- gcp_bigquery_storage_write:
    name: write_accounts
    credentials_path: /etc/gcp/service-account.json
    project_id: my-project
    dataset_id: salesforce
    table_id: accounts

Write fields

FieldTypeDefaultDescription
namestringrequiredTask name.
credentials_pathstringGCP service account credentials. Falls back to Application Default Credentials when omitted.
project_idstringrequiredGCP project ID.
dataset_idstringrequiredBigQuery dataset.
table_idstringrequiredBigQuery table.
change_typestringCDC change type: upsert or delete.
depends_onlistUpstream task names.
retryobjectRetry configuration.

Output

Read

FormatCrateDescription
Arrow RecordBatchgoogle-cloud-bigqueryTable data via the BigQuery Storage Read API. Emits one event per batch. Empty tables emit an empty batch with the target schema.

Write

FieldTypeDescription
rows_writtenintNumber of rows appended.