Share via


cloud_files_state table-valued function

Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime 11.3 LTS and above

Returns the file-level state of an Auto Loader or read_files stream.

Syntax

cloud_files_state( { TABLE ( table_name ) | checkpoint } )

Arguments

  • table_name: The identifier of the streaming table that’s being written to by read_files. The name must not include a temporal specification. Available in Databricks Runtime 13.3 LTS and above.
  • checkpoint: A STRING literal. The checkpoint directory for a stream using the Auto Loader source. See What is Auto Loader?.

Returns

Returns a table with the following schema:

  • path STRING NOT NULL PRIMARY KEY

    The path of a file.

  • size BIGINT NOT NULL

    The size of a file in bytes.

  • create_time TIMESTAMP NOT NULL

    The time that a file was created.

  • discovery_time TIMESTAMP NOT NULL

    Applies to: check marked yes Databricks Runtime 16.4 and above

    The time that a file was discovered.

  • processed_time TIMESTAMP NOT NULL

    Applies to: check marked yes Databricks Runtime 16.4 and above when cloudFiles.cleanSource is enabled. See Auto Loader options.

    The time that a file was processed. If a batch encounters a failure and is retried, a file might be processed multiple times. When retries happen, this field contains the most recent processing time.

  • commit_time TIMESTAMP

    Applies to: check marked yes Databricks Runtime 16.4 and above when cloudFiles.cleanSource is enabled. See Auto Loader options.

    The time that a file was committed to the checkpoint after processing. NULL if the file is not yet processed. There is no guaranteed latency for marking a file as committed; a file might be processed but marked as committed arbitrarily later. Marking the file as committed means that Auto Loader does not require the file for processing again.

  • archive_time TIMESTAMP

    Applies to: check marked yes Databricks Runtime 16.4 and above when cloudFiles.cleanSource is enabled. See Auto Loader options.

    The time that a file was archived. NULL if the file has not been archived.

  • archive_mode STRING

    Applies to: check marked yes Databricks Runtime 16.4 and above when cloudFiles.cleanSource is enabled. See Auto Loader options.

    MOVE if cloudFiles.cleanSource was set to MOVE when the file was archived.

    DELETE if cloudFiles.cleanSource was set to DELETE when the file was archived.

    NULL if cloudFiles.cleanSource was set to OFF (default).

  • move_location STRING

    Applies to: check marked yes Databricks Runtime 16.4 and above when cloudFiles.cleanSource is enabled. See Auto Loader options.

    The full path of where the file was moved to during the archival operation when cloudFiles.cleanSource was set to MOVE.

    NULL if the file has not been archived or cloudFiles.cleanSource is one of DELETE or OFF.

  • source_id STRING

    The ID of the Auto Loader source in the streaming query. This value is '0' for streams that ingest from a single cloud object store location.

  • flow_name STRING

    Applies to: check marked yes Databricks SQL check marked yes Databricks Runtime 13.3 and above

    Represents a specific streaming flow in DLT that contains one or more cloud files sources. NULL if no table_name was given.

  • ingestion_state STRING

    Applies to: check marked yes Databricks Runtime 16.4 and above when cloudFiles.cleanSource is enabled. See Auto Loader options.

    Whether the file has been ingested, indicated by one of the following states:

    • NULL: The file has not been processed yet, or the file state cannot be determined by Auto Loader.
    • PROCESSING: The file is being processed.
    • SKIPPED_CORRUPTED: The file was not ingested because it was corrupt.
    • SKIPPED_MISSING: The file was not ingested because it was not found during processing.
    • INGESTED: The file has been processed by the sink at least once. It may be processed again by non-idempotent sinks like foreachBatch in case of failures in the stream. Only files with a non-null commit_time field that are in the INGESTED state have completed processing.
    • NOT_RECOGNIZED_BY_DBR: Reserved for version compatibility. This state will be displayed for states introduced in later Databricks Runtime versions that are unrecognized by earlier Databricks Runtime versions.

Permissions

You need to have:

  • OWNER privileges on the streaming table if using a streaming table identifier.
  • READ FILES privileges on the checkpoint location if providing a checkpoint under an external location.

Examples

-- Simple example from checkpoint
> SELECT path FROM CLOUD_FILES_STATE('/some/checkpoint');
  /some/input/path
  /other/input/path

-- Simple example from source subdir
> SELECT path FROM CLOUD_FILES_STATE('/some/checkpoint/sources/0');
  /some/input/path
  /other/input/path

-- Simple example from streaming table
> SELECT path FROM CLOUD_FILES_STATE(TABLE(my_streaming_table));
  /some/input/path
  /other/input/path