yogadl.storage¶

class yogadl.storage.LFSConfigurations(storage_dir_path: str): Configurations for LFSStorage.

class yogadl.storage.LFSStorage(configurations: yogadl.storage._lfs_storage.LFSConfigurations, tensorflow_config: Optional[tensorflow.core.protobuf.config_pb2.ConfigProto] = None)

Storage for local file system (not NFS).

cacheable(dataset_id: str, dataset_version: str) → Callable: A decorator that calls submit and fetch and is responsible for coordinating amongst instantiations of Storage in different processes.

fetch(dataset_id: str, dataset_version: str) → yogadl.dataref._local_lmdb_dataref.LMDBDataRef: Fetch a dataset from storage and provide a DataRef for streaming it.

submit(data: tensorflow.python.data.ops.dataset_ops.DatasetV2, dataset_id: str, dataset_version: str) → None

Stores dataset to a cache and updates metadata file with information.

If a cache with a matching filepath already exists, it will be overwritten.

class yogadl.storage.S3Configurations(bucket: str, bucket_directory_path: str, url: str, access_key: Optional[str] = None, secret_key: Optional[str] = None, endpoint_url: Optional[str] = None, local_cache_dir: str = '/tmp/', skip_verify: bool = False, coordinator_cert_file: Optional[str] = None, coordinator_cert_name: Optional[str] = None)

class yogadl.storage.S3Storage(configurations: yogadl.storage._s3_storage.S3Configurations, tensorflow_config: Optional[tensorflow.core.protobuf.config_pb2.ConfigProto] = None)

Stores dataset cache in AWS S3.

S3Storage creates a local cache from a dataset and then uploads it to the specified S3 bucket. When fetching from S3, the creation time of the local cache (recorded in metadata), is compared to the creation time of the S3 cache, if they are not equivalent, the local cache is overwritten.

The S3 cache, and the local cache are potentially shared across a number of concurrent processes. cacheable() provides synchronization guarantees. Users should not call submit() and fetch() if they anticipate concurrent data accesses.

cacheable(dataset_id: str, dataset_version: str) → Callable

A decorator that calls submit and fetch and is responsible for coordinating amongst instantiations of Storage in different processes.

Initially requests a read lock, if cache is not present in cloud storage, will request a write lock and submit to cloud storage. Once file is present in cloud storage, will request a read lock and fetch.

fetch(dataset_id: str, dataset_version: str) → yogadl.dataref._local_lmdb_dataref.LMDBDataRef

Fetch a dataset from cloud storage and provide a DataRef for streaming it.

The timestamp of the cache in cloud storage is compared to the creation time of the local cache, if they are not identical, the local cache is overwritten.

fetch() is not safe for concurrent accesses. For concurrent accesses use cacheable().

submit(data: tensorflow.python.data.ops.dataset_ops.DatasetV2, dataset_id: str, dataset_version: str) → None

Stores dataset by creating a local cache and uploading it to cloud storage.

If a cache with a matching filepath already exists in cloud storage, it will be overwritten.

submit() is not safe for concurrent accesses. For concurrent accesses use cacheable().

class yogadl.storage.GCSConfigurations(bucket: str, bucket_directory_path: str, url: str, local_cache_dir: str = '/tmp/', skip_verify: bool = False, coordinator_cert_file: Optional[str] = None, coordinator_cert_name: Optional[str] = None)

class yogadl.storage.GCSStorage(configurations: yogadl.storage._gcs_storage.GCSConfigurations, tensorflow_config: Optional[tensorflow.core.protobuf.config_pb2.ConfigProto] = None)

Stores dataset cache in Google Cloud Storage (GCS).

GCSStorage creates a local cache from a dataset and then uploads it to the specified GCS bucket. When fetching from GCS, the creation time of the local cache (recorded in metadata), is compared to the creation time of the GCS cache, if they are not equivalent, the local cache is overwritten.

The GCS cache, and the local cache are potentially shared across a number of concurrent processes. cacheable() provides synchronization guarantees. Users should not call submit() and fetch() if they anticipate concurrent data accesses.

Authentication is currently only supported via the “Application Default Credentials” method in GCP. Typical configuration: ensure your VM runs in a service account that has sufficient permissions to read/write/delete from the GCS bucket where checkpoints will be stored (this only works when running in GCE).