JSON Schema for a Comet Data Pipeline. Json Format

Use this form to visualize JSON Schema for a Comet Data Pipeline..

types types

#	name	primitiveType	pattern	zone	sample	comment	indexMapping	Actions
{{$index+1}}.	{{row.name}}	{{row.primitiveType}}	{{row.pattern}}	{{row.zone}}	{{row.sample}}	{{row.comment}}	{{row.indexMapping}}

.env

_

.transform

_

transform.name - Job name. Must be set to the prefix of the filename. [JOBNAME].comet.yml

transform.area - Area where the data is located.\nWhen using the BigQuery engine, the area corresponds to the dataset name we will be working on in this job.\nWhen using the Spark engine, this is folder where the data should be store. Default value is "business"

transform.format - output file format when using Spark engine. Ignored for BigQuery. Default value is "parquet"

transform.coalesce - When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.

transform.udf - Register UDFs written in this JVM class when using Spark engine.\nRegister UDFs stored at this location when using BigQuery engine

transform.tasks tasks

#	sql	engine	domain	dataset	write	area	sink.type	sink.name	sink.id	sink.timestamp	Actions
{{$index+1}}.	{{row.sql}}	{{row.engine}}	{{row.domain}}	{{row.dataset}}	{{row.write}}	{{row.area}}	{{row.sink.type}}	{{row.sink.name}}	{{row.sink.id}}	{{row.sink.timestamp}}

.transform.views

_

transform.engine - SPARK or BQ. Default value is SPARK.

.load

_

load.name - Domain name. Make sure you use a name that may be used as a folder name on the target storage. - When using HDFS or Cloud Storage, files once ingested are stored in a sub-directory named after the domain name. - When used with BigQuery, files are ingested and sorted in tables under a dataset named after the domain name.

load.directory - Folder on the local filesystem where incoming files are stored. Typically, this folder will be scanned periodically to move the dataset to the cluster for ingestion. Files located in this folder are moved to the pending folder for ingestion by the "import" command.

.load.metadata

_

.load.metadata.partition

_

load.metadata.partition.sampling - 0.0 means no sampling, > 0 && < 1 means sample dataset, >=1 absolute number of partitions.

load.metadata.mode - FILE mode by default.\nFILE and STREAM are the two accepted values.\nFILE is currently the only supported mode.

load.metadata.format - DSV by default. Supported file formats are :\n- DSV : Delimiter-separated values file. Delimiter value is specified in the "separator" field.\n- POSITION : FIXED format file where values are located at an exact position in each line.\n- SIMPLE_JSON : For optimisation purpose, we differentiate JSON with top level values from JSON\n with deep level fields. SIMPLE_JSON are JSON files with top level fields only.\n- JSON : Deep JSON file. Use only when your json documents contain subdocuments, otherwise prefer to\n use SIMPLE_JSON since it is much faster.\n- XML : XML files

load.metadata.encoding - UTF-8 if not specified.

load.metadata.multiline - are json objects on a single line or multiple line ? Single by default. false means single. false also means faster

load.metadata.array - Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.

load.metadata.withHeader - does the dataset has a header ? true bu default

load.metadata.separator - the values delimiter, ';' by default value may be a multichar string starting from Spark3

load.metadata.quote - The String quote char, '"' by default

load.metadata.escape - escaping char '\' by default

load.metadata.write - Append to or overwrite existing data

load.metadata.partition.attributes attributes

#	Values	Actions
{{$index+1}}.

.load.metadata.sink

_

load.metadata.sink.type - Where to sink the data

load.metadata.sink.name - Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.

load.metadata.sink.id - ES: Attribute to use as id of the document. Generated by Elasticsearch if not specified.

load.metadata.sink.timestamp - BQ: The timestamp column to use for table partitioning if any. No partitioning by default\nES:Timestamp field format as expected by Elasticsearch ("{beginTs|yyyy.MM.dd}" for example).

load.metadata.sink.location - BQ: Database location (EU, US, ...)

load.metadata.sink.clustering - BQ: List of ordered columns to use for table clustering

load.metadata.sink.days - BQ: Number of days before this table is set as expired and deleted. Never by default.

load.metadata.sink.requirePartitionFilter - BQ: Should be require a partition filter on every request ? No by default.

load.metadata.sink.connection - JDBC: Connection String

load.metadata.sink.partitions - JDBC: Number of Spark partitions

load.metadata.sink.batchSize - JDBC: Batch size of each JDBC bulk insert

load.metadata.ignore - Pattern to ignore or UDF to apply to ignore some lines

load.metadata.clustering List of attributes to use for clustering

#	Values	Actions
{{$index+1}}.

.load.metadata.xml

_

load.comment - Domain Description (free text)

load.ack - Ack extension used for each file. ".ack" if not specified. Files are moved to the pending folder only once a file with the same name as the source file and with this extension is present. To move a file without requiring an ack file to be present, set explicitly this property to the empty string value "".

load.schemaRefs List of files containing the schemas. Should start with an '_' and be located in the same folder.

#	Values	Actions
{{$index+1}}.

load.schemas List of schemas for each dataset in this domain. A domain usually contains multiple schemas. Each schema defining how the contents of the input file should be parsed. See Schema for more details.

#	name	pattern	metadata.partition.sampling	metadata.mode	metadata.format	metadata.encoding	metadata.multiline	metadata.array	metadata.withHeader	metadata.separator	Actions
{{$index+1}}.	{{row.name}}	{{row.pattern}}	{{row.metadata.partition.sampling}}	{{row.metadata.mode}}	{{row.metadata.format}}	{{row.metadata.encoding}}	{{row.metadata.multilineSelected.DisplayText}}	{{row.metadata.arraySelected.DisplayText}}	{{row.metadata.withHeaderSelected.DisplayText}}	{{row.metadata.separator}}

load.extensions recognized filename extensions. json, csv, dsv, psv are recognized by default. Only files with these extensions will be moved to the pending folder.

#	Values	Actions
{{$index+1}}.

.assertions

_

.views

_

schemas List of schemas for each dataset in this domain. A domain usually contains multiple schemas. Each schema defining how the contents of the input file should be parsed. See Schema for more details.

#	name	pattern	metadata.partition.sampling	metadata.mode	metadata.format	metadata.encoding	metadata.multiline	metadata.array	metadata.withHeader	metadata.separator	Actions
{{$index+1}}.	{{row.name}}	{{row.pattern}}	{{row.metadata.partition.sampling}}	{{row.metadata.mode}}	{{row.metadata.format}}	{{row.metadata.encoding}}	{{row.metadata.multilineSelected.DisplayText}}	{{row.metadata.arraySelected.DisplayText}}	{{row.metadata.withHeaderSelected.DisplayText}}	{{row.metadata.separator}}

sql - Main SQL request to execute (do not forget to prefix table names with the database name to avoid conflicts)

engine - SPARK or BQ. Default value is SPARK.

domain - Output domain in output Area (Will be the Database name in Hive or Dataset in BigQuery)

dataset - Dataset Name in output Area (Will be the Table name in Hive & BigQuery)

write - Append to or overwrite existing data

area - Target Area where domain / dataset will be stored.

partition List of columns used for partitioning the output.

#	Values	Actions
{{$index+1}}.

presql List of SQL requests to executed before the main SQL request is run

#	Values	Actions
{{$index+1}}.

postsql List of SQL requests to executed after the main SQL request is run

#	Values	Actions
{{$index+1}}.

.sink

_transform_tasks

sink.type - Where to sink the data

sink.name - Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.

sink.id - ES: Attribute to use as id of the document. Generated by Elasticsearch if not specified.

sink.timestamp - BQ: The timestamp column to use for table partitioning if any. No partitioning by default\nES:Timestamp field format as expected by Elasticsearch ("{beginTs|yyyy.MM.dd}" for example).

sink.location - BQ: Database location (EU, US, ...)

sink.clustering - BQ: List of ordered columns to use for table clustering

sink.days - BQ: Number of days before this table is set as expired and deleted. Never by default.

sink.requirePartitionFilter - BQ: Should be require a partition filter on every request ? No by default.

sink.connection - JDBC: Connection String

sink.partitions - JDBC: Number of Spark partitions

sink.batchSize - JDBC: Batch size of each JDBC bulk insert

rls rls

#	name	predicate	Actions
{{$index+1}}.	{{row.name}}	{{row.predicate}}

.assertions

_transform_tasks

name - This Row Level Security unique name

predicate - The condition that goes to the WHERE clause and limit the visible rows.

grants user / groups / service accounts to which this security level is applied. ex : user:me@mycompany.com,group:group@mycompany.com,serviceAccount:mysa@google-accounts.com

#	Values	Actions
{{$index+1}}.

name - Schema name, must be unique among all the schemas belonging to the same domain. * Will become the hive table name On Premise or BigQuery Table name on GCP.

pattern - filename pattern to which this schema must be applied. * This instructs the framework to use this schema to parse any file with a filename that match this pattern.

primaryKey List of columns that make up the primary key

#	Values	Actions
{{$index+1}}.

attributes Attributes parsing rules.

#	name	type	foreignKey	array	required	privacy	comment	rename	metricType	position.first	Actions
{{$index+1}}.	{{row.name}}	{{row.type}}	{{row.foreignKey}}	{{row.arraySelected.DisplayText}}	{{row.requiredSelected.DisplayText}}	{{row.privacy}}	{{row.comment}}	{{row.rename}}	{{row.metricType}}	{{row.position.first}}

.metadata

_load_schemas

.metadata.partition

_load_schemas

metadata.partition.sampling - 0.0 means no sampling, > 0 && < 1 means sample dataset, >=1 absolute number of partitions.

metadata.mode - FILE mode by default.\nFILE and STREAM are the two accepted values.\nFILE is currently the only supported mode.

metadata.format - DSV by default. Supported file formats are :\n- DSV : Delimiter-separated values file. Delimiter value is specified in the "separator" field.\n- POSITION : FIXED format file where values are located at an exact position in each line.\n- SIMPLE_JSON : For optimisation purpose, we differentiate JSON with top level values from JSON\n with deep level fields. SIMPLE_JSON are JSON files with top level fields only.\n- JSON : Deep JSON file. Use only when your json documents contain subdocuments, otherwise prefer to\n use SIMPLE_JSON since it is much faster.\n- XML : XML files

metadata.encoding - UTF-8 if not specified.

metadata.multiline - are json objects on a single line or multiple line ? Single by default. false means single. false also means faster

metadata.array - Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.

metadata.withHeader - does the dataset has a header ? true bu default

metadata.separator - the values delimiter, ';' by default value may be a multichar string starting from Spark3

metadata.quote - The String quote char, '"' by default

metadata.escape - escaping char '\' by default

metadata.write - Append to or overwrite existing data

metadata.partition.attributes attributes

#	Values	Actions
{{$index+1}}.

.metadata.sink

_load_schemas

metadata.sink.type - Where to sink the data

metadata.sink.name - Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.

metadata.sink.id - ES: Attribute to use as id of the document. Generated by Elasticsearch if not specified.

metadata.sink.timestamp - BQ: The timestamp column to use for table partitioning if any. No partitioning by default\nES:Timestamp field format as expected by Elasticsearch ("{beginTs|yyyy.MM.dd}" for example).

metadata.sink.location - BQ: Database location (EU, US, ...)

metadata.sink.clustering - BQ: List of ordered columns to use for table clustering

metadata.sink.days - BQ: Number of days before this table is set as expired and deleted. Never by default.

metadata.sink.requirePartitionFilter - BQ: Should be require a partition filter on every request ? No by default.

metadata.sink.connection - JDBC: Connection String

metadata.sink.partitions - JDBC: Number of Spark partitions

metadata.sink.batchSize - JDBC: Batch size of each JDBC bulk insert

metadata.ignore - Pattern to ignore or UDF to apply to ignore some lines

metadata.clustering List of attributes to use for clustering

#	Values	Actions
{{$index+1}}.

.metadata.xml

_load_schemas

.merge

_load_schemas

merge.delete - Optional valid sql condition on the incoming dataset. Use renamed column here.

merge.timestamp - Timestamp column used to identify last version, if not specified currently ingested row is considered the last

merge.queryFilter

comment - free text

merge.key list of attributes to join existing with incoming dataset. Use renamed columns here.

#	Values	Actions
{{$index+1}}.

presql Reserved for future use.

#	Values	Actions
{{$index+1}}.

postsql Reserved for future use.

#	Values	Actions
{{$index+1}}.

tags Set of string to attach to this Schema

#	Values	Actions
{{$index+1}}.

rls Experimental. Row level security to this to this schema.

#	name	predicate	Actions
{{$index+1}}.	{{row.name}}	{{row.predicate}}

.assertions

_load_schemas

name - Attribute name as defined in the source dataset and as received in the file

type - semantic type of the attribute

foreignKey - If this attribute is a foreign key, reference to [domain.]table[.attribute]

array - Is it an array ?

required - Should this attribute always be present in the source

privacy - Should this attribute be applied a privacy transformation at ingestion time

comment - free text for attribute description

rename - If present, the attribute is renamed with this name

metricType - If present, what kind of stat should be computed for this field

attributes List of sub-attributes (valid for JSON and XML files only)

#	name	type	foreignKey	array	required	privacy	comment	rename	metricType	position.first	Actions
{{$index+1}}.	{{row.name}}	{{row.type}}	{{row.foreignKey}}	{{row.arraySelected.DisplayText}}	{{row.requiredSelected.DisplayText}}	{{row.privacy}}	{{row.comment}}	{{row.rename}}	{{row.metricType}}	{{row.position.first}}

.position

_attributes

position.first

position.last

default - Default value for this attribute when it is not present.

trim

script - Scripted field : SQL request on renamed column

tags Tags associated with this attribute

#	Values	Actions
{{$index+1}}.

name - Schema name, must be unique among all the schemas belonging to the same domain. * Will become the hive table name On Premise or BigQuery Table name on GCP.

pattern - filename pattern to which this schema must be applied. * This instructs the framework to use this schema to parse any file with a filename that match this pattern.

primaryKey List of columns that make up the primary key

#	Values	Actions
{{$index+1}}.

attributes Attributes parsing rules.

#	name	type	foreignKey	array	required	privacy	comment	rename	metricType	position.first	Actions
{{$index+1}}.	{{row.name}}	{{row.type}}	{{row.foreignKey}}	{{row.arraySelected.DisplayText}}	{{row.requiredSelected.DisplayText}}	{{row.privacy}}	{{row.comment}}	{{row.rename}}	{{row.metricType}}	{{row.position.first}}

.metadata

_schemas

.metadata.partition

_schemas

metadata.partition.sampling - 0.0 means no sampling, > 0 && < 1 means sample dataset, >=1 absolute number of partitions.

metadata.mode - FILE mode by default.\nFILE and STREAM are the two accepted values.\nFILE is currently the only supported mode.

metadata.encoding - UTF-8 if not specified.

metadata.multiline - are json objects on a single line or multiple line ? Single by default. false means single. false also means faster

metadata.array - Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.

metadata.withHeader - does the dataset has a header ? true bu default

metadata.separator - the values delimiter, ';' by default value may be a multichar string starting from Spark3

metadata.quote - The String quote char, '"' by default

metadata.escape - escaping char '\' by default

metadata.write - Append to or overwrite existing data

metadata.partition.attributes attributes

#	Values	Actions
{{$index+1}}.

.metadata.sink

_schemas

metadata.sink.type - Where to sink the data

metadata.sink.name - Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.

metadata.sink.id - ES: Attribute to use as id of the document. Generated by Elasticsearch if not specified.

metadata.sink.location - BQ: Database location (EU, US, ...)

metadata.sink.clustering - BQ: List of ordered columns to use for table clustering

metadata.sink.days - BQ: Number of days before this table is set as expired and deleted. Never by default.

metadata.sink.requirePartitionFilter - BQ: Should be require a partition filter on every request ? No by default.

metadata.sink.connection - JDBC: Connection String

metadata.sink.partitions - JDBC: Number of Spark partitions

metadata.sink.batchSize - JDBC: Batch size of each JDBC bulk insert

metadata.ignore - Pattern to ignore or UDF to apply to ignore some lines

metadata.clustering List of attributes to use for clustering

#	Values	Actions
{{$index+1}}.

.metadata.xml

_schemas

.merge

_schemas

merge.delete - Optional valid sql condition on the incoming dataset. Use renamed column here.

merge.timestamp - Timestamp column used to identify last version, if not specified currently ingested row is considered the last

merge.queryFilter

comment - free text

merge.key list of attributes to join existing with incoming dataset. Use renamed columns here.

#	Values	Actions
{{$index+1}}.

presql Reserved for future use.

#	Values	Actions
{{$index+1}}.

postsql Reserved for future use.

#	Values	Actions
{{$index+1}}.

tags Set of string to attach to this Schema

#	Values	Actions
{{$index+1}}.

rls Experimental. Row level security to this to this schema.

#	name	predicate	Actions
{{$index+1}}.	{{row.name}}	{{row.predicate}}

.assertions

ToolMatter

Your Apps in Action

Use this form to visualize JSON Schema for a Comet Data Pipeline..

{{repoTitle.MainEntity}}

types types

_

_

transform.tasks tasks

_

_

_

_

load.metadata.partition.attributes attributes

_

load.metadata.clustering List of attributes to use for clustering

_

load.schemaRefs List of files containing the schemas. Should start with an '_' and be located in the same folder.

load.schemas List of schemas for each dataset in this domain. A domain usually contains multiple schemas. Each schema defining how the contents of the input file should be parsed. See Schema for more details.

load.extensions recognized filename extensions. json, csv, dsv, psv are recognized by default. Only files with these extensions will be moved to the pending folder.

_

_

schemas List of schemas for each dataset in this domain. A domain usually contains multiple schemas. Each schema defining how the contents of the input file should be parsed. See Schema for more details.

{{repoTitle.MainEntity}}

partition List of columns used for partitioning the output.

presql List of SQL requests to executed before the main SQL request is run

postsql List of SQL requests to executed after the main SQL request is run

_transform_tasks

rls rls

_transform_tasks

{{repoTitle.MainEntity}}

grants user / groups / service accounts to which this security level is applied. ex : user:me@mycompany.com,group:group@mycompany.com,serviceAccount:mysa@google-accounts.com

{{repoTitle.MainEntity}}

primaryKey List of columns that make up the primary key

attributes Attributes parsing rules.

_load_schemas

_load_schemas

metadata.partition.attributes attributes

_load_schemas

metadata.clustering List of attributes to use for clustering

_load_schemas

_load_schemas

merge.key list of attributes to join existing with incoming dataset. Use renamed columns here.

presql Reserved for future use.

postsql Reserved for future use.

tags Set of string to attach to this Schema

rls Experimental. Row level security to this to this schema.

_load_schemas

{{repoTitle.MainEntity}}

attributes List of sub-attributes (valid for JSON and XML files only)

_attributes

tags Tags associated with this attribute

{{repoTitle.MainEntity}}

primaryKey List of columns that make up the primary key

attributes Attributes parsing rules.

_schemas

_schemas

metadata.partition.attributes attributes

_schemas

metadata.clustering List of attributes to use for clustering

_schemas

_schemas

merge.key list of attributes to join existing with incoming dataset. Use renamed columns here.

presql Reserved for future use.

postsql Reserved for future use.

tags Set of string to attach to this Schema

rls Experimental. Row level security to this to this schema.

_schemas