airflow.contrib.hooks.gcs_hook

Module Contents

class airflow.contrib.hooks.gcs_hook.GoogleCloudStorageHook(google_cloud_storage_conn_id='google_cloud_default', delegate_to=None)[source]

Bases: airflow.contrib.hooks.gcp_api_base_hook.GoogleCloudBaseHook

Interact with Google Cloud Storage. This hook uses the Google Cloud Platform connection.

get_conn(self)[source]

Returns a Google Cloud Storage service object.

copy(self, source_bucket, source_object, destination_bucket=None, destination_object=None)[source]

Copies an object from a bucket to another, with renaming if requested.

destination_bucket or destination_object can be omitted, in which case source bucket/object is used, but not both.

Parameters
  • source_bucket (str) – The bucket of the object to copy from.

  • source_object (str) – The object to copy.

  • destination_bucket (str) – The destination of the object to copied to. Can be omitted; then the same bucket is used.

  • destination_object (str) – The (renamed) path of the object if given. Can be omitted; then the same name is used.

rewrite(self, source_bucket, source_object, destination_bucket, destination_object=None)[source]

Has the same functionality as copy, except that will work on files over 5 TB, as well as when copying between locations and/or storage classes.

destination_object can be omitted, in which case source_object is used.

Parameters
  • source_bucket (str) – The bucket of the object to copy from.

  • source_object (str) – The object to copy.

  • destination_bucket (str) – The destination of the object to copied to.

  • destination_object – The (renamed) path of the object if given. Can be omitted; then the same name is used.

download(self, bucket, object, filename=None)[source]

Get a file from Google Cloud Storage.

Parameters
  • bucket (str) – The bucket to fetch from.

  • object (str) – The object to fetch.

  • filename (str) – If set, a local file path where the file should be written to.

upload(self, bucket, object, filename, mime_type='application/octet-stream', gzip=False, multipart=False, num_retries=0)[source]

Uploads a local file to Google Cloud Storage.

Parameters
  • bucket (str) – The bucket to upload to.

  • object (str) – The object name to set when uploading the local file.

  • filename (str) – The local file path to the file to be uploaded.

  • mime_type (str) – The MIME type to set when uploading the file.

  • gzip (bool) – Option to compress file for upload

  • multipart (bool or int) – If True, the upload will be split into multiple HTTP requests. The default size is 256MiB per request. Pass a number instead of True to specify the request size, which must be a multiple of 262144 (256KiB).

  • num_retries (int) – The number of times to attempt to re-upload the file (or individual chunks, in the case of multipart uploads). Retries are attempted with exponential backoff.

exists(self, bucket, object)[source]

Checks for the existence of a file in Google Cloud Storage.

Parameters
  • bucket (str) – The Google cloud storage bucket where the object is.

  • object (str) – The name of the object to check in the Google cloud storage bucket.

is_updated_after(self, bucket, object, ts)[source]

Checks if an object is updated in Google Cloud Storage.

Parameters
  • bucket (str) – The Google cloud storage bucket where the object is.

  • object (str) – The name of the object to check in the Google cloud storage bucket.

  • ts (datetime.datetime) – The timestamp to check against.

delete(self, bucket, object, generation=None)[source]

Delete an object if versioning is not enabled for the bucket, or if generation parameter is used.

Parameters
  • bucket (str) – name of the bucket, where the object resides

  • object (str) – name of the object to delete

  • generation (str) – if present, permanently delete the object of this generation

Returns

True if succeeded

list(self, bucket, versions=None, maxResults=None, prefix=None, delimiter=None)[source]

List all objects from the bucket with the give string prefix in name

Parameters
  • bucket (str) – bucket name

  • versions (bool) – if true, list all versions of the objects

  • maxResults (int) – max count of items to return in a single page of responses

  • prefix (str) – prefix string which filters objects whose name begin with this prefix

  • delimiter (str) – filters objects based on the delimiter (for e.g ‘.csv’)

Returns

a stream of object names matching the filtering criteria

get_size(self, bucket, object)[source]

Gets the size of a file in Google Cloud Storage.

Parameters
  • bucket (str) – The Google cloud storage bucket where the object is.

  • object (str) – The name of the object to check in the Google cloud storage bucket.

get_crc32c(self, bucket, object)[source]

Gets the CRC32c checksum of an object in Google Cloud Storage.

Parameters
  • bucket (str) – The Google cloud storage bucket where the object is.

  • object (str) – The name of the object to check in the Google cloud storage bucket.

get_md5hash(self, bucket, object)[source]

Gets the MD5 hash of an object in Google Cloud Storage.

Parameters
  • bucket (str) – The Google cloud storage bucket where the object is.

  • object (str) – The name of the object to check in the Google cloud storage bucket.

create_bucket(self, bucket_name, resource=None, storage_class='MULTI_REGIONAL', location='US', project_id=None, labels=None)[source]

Creates a new bucket. Google Cloud Storage uses a flat namespace, so you can’t create a bucket with a name that is already in use.

See also

For more information, see Bucket Naming Guidelines: https://cloud.google.com/storage/docs/bucketnaming.html#requirements

Parameters
  • bucket_name (str) – The name of the bucket.

  • resource (dict) – An optional dict with parameters for creating the bucket. For information on available parameters, see Cloud Storage API doc: https://cloud.google.com/storage/docs/json_api/v1/buckets/insert

  • storage_class (str) –

    This defines how objects in the bucket are stored and determines the SLA and the cost of storage. Values include

    • MULTI_REGIONAL

    • REGIONAL

    • STANDARD

    • NEARLINE

    • COLDLINE.

    If this value is not specified when the bucket is created, it will default to STANDARD.

  • location (str) –

    The location of the bucket. Object data for objects in the bucket resides in physical storage within this region. Defaults to US.

  • project_id (str) – The ID of the GCP Project.

  • labels (dict) – User-provided labels, in key/value pairs.

Returns

If successful, it returns the id of the bucket.

insert_bucket_acl(self, bucket, entity, role, user_project)[source]

Creates a new ACL entry on the specified bucket. See: https://cloud.google.com/storage/docs/json_api/v1/bucketAccessControls/insert

Parameters
  • bucket (str) – Name of a bucket.

  • entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers. See: https://cloud.google.com/storage/docs/access-control/lists#scopes

  • role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”, “WRITER”.

  • user_project (str) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.

insert_object_acl(self, bucket, object_name, entity, role, generation, user_project)[source]

Creates a new ACL entry on the specified object. See: https://cloud.google.com/storage/docs/json_api/v1/objectAccessControls/insert

Parameters
  • bucket (str) – Name of a bucket.

  • object_name (str) – Name of the object. For information about how to URL encode object names to be path safe, see: https://cloud.google.com/storage/docs/json_api/#encoding

  • entity (str) – The entity holding the permission, in one of the following forms: user-userId, user-email, group-groupId, group-email, domain-domain, project-team-projectId, allUsers, allAuthenticatedUsers See: https://cloud.google.com/storage/docs/access-control/lists#scopes

  • role (str) – The access permission for the entity. Acceptable values are: “OWNER”, “READER”.

  • generation (str) – (Optional) If present, selects a specific revision of this object (as opposed to the latest version, the default).

  • user_project (str) – (Optional) The project to be billed for this request. Required for Requester Pays buckets.

compose(self, bucket, source_objects, destination_object, num_retries=5)[source]

Composes a list of existing object into a new object in the same storage bucket

Currently it only supports up to 32 objects that can be concatenated in a single operation

https://cloud.google.com/storage/docs/json_api/v1/objects/compose

Parameters
  • bucket (str) – The name of the bucket containing the source objects. This is also the same bucket to store the composed destination object.

  • source_objects (list) – The list of source objects that will be composed into a single object.

  • destination_object (str) – The path of the object if given.

airflow.contrib.hooks.gcs_hook._parse_gcs_url(gsurl)[source]
Given a Google Cloud Storage URL (gs://<bucket>/<blob>), returns a
tuple containing the corresponding bucket and blob.