ToolMatter
Your Apps in Action
Risersoft
Buy
Try
Explore
Documentation
Public
Use this form to visualize JSON Schema for a Comet Data Pipeline..
{{repoTitle.MainEntity}}
types types
#
name
primitiveType
pattern
zone
sample
comment
indexMapping
Actions
{{$index+1}}.
{{row.name}}
{{row.primitiveType}}
{{row.pattern}}
{{row.zone}}
{{row.sample}}
{{row.comment}}
{{row.indexMapping}}
.env
_
.transform
_
transform.name - Job name. Must be set to the prefix of the filename. [JOBNAME].comet.yml
transform.area - Area where the data is located.\nWhen using the BigQuery engine, the area corresponds to the dataset name we will be working on in this job.\nWhen using the Spark engine, this is folder where the data should be store. Default value is "business"
transform.format - output file format when using Spark engine. Ignored for BigQuery. Default value is "parquet"
transform.coalesce - When outputting files, should we coalesce it to a single file. Useful when CSV is the output format.
transform.udf - Register UDFs written in this JVM class when using Spark engine.\nRegister UDFs stored at this location when using BigQuery engine
transform.tasks tasks
#
sql
engine
domain
dataset
write
area
sink.type
sink.name
sink.id
sink.timestamp
Actions
{{$index+1}}.
{{row.sql}}
{{row.engine}}
{{row.domain}}
{{row.dataset}}
{{row.write}}
{{row.area}}
{{row.sink.type}}
{{row.sink.name}}
{{row.sink.id}}
{{row.sink.timestamp}}
.transform.views
_
transform.engine - SPARK or BQ. Default value is SPARK.
.load
_
load.name - Domain name. Make sure you use a name that may be used as a folder name on the target storage. - When using HDFS or Cloud Storage, files once ingested are stored in a sub-directory named after the domain name. - When used with BigQuery, files are ingested and sorted in tables under a dataset named after the domain name.
load.directory - Folder on the local filesystem where incoming files are stored. Typically, this folder will be scanned periodically to move the dataset to the cluster for ingestion. Files located in this folder are moved to the pending folder for ingestion by the "import" command.
.load.metadata
_
.load.metadata.partition
_
load.metadata.partition.sampling - 0.0 means no sampling, > 0 && < 1 means sample dataset, >=1 absolute number of partitions.
load.metadata.mode - FILE mode by default.\nFILE and STREAM are the two accepted values.\nFILE is currently the only supported mode.
load.metadata.format - DSV by default. Supported file formats are :\n- DSV : Delimiter-separated values file. Delimiter value is specified in the "separator" field.\n- POSITION : FIXED format file where values are located at an exact position in each line.\n- SIMPLE_JSON : For optimisation purpose, we differentiate JSON with top level values from JSON\n with deep level fields. SIMPLE_JSON are JSON files with top level fields only.\n- JSON : Deep JSON file. Use only when your json documents contain subdocuments, otherwise prefer to\n use SIMPLE_JSON since it is much faster.\n- XML : XML files
load.metadata.encoding - UTF-8 if not specified.
load.metadata.multiline - are json objects on a single line or multiple line ? Single by default. false means single. false also means faster
load.metadata.array - Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.
load.metadata.withHeader - does the dataset has a header ? true bu default
load.metadata.separator - the values delimiter, ';' by default value may be a multichar string starting from Spark3
load.metadata.quote - The String quote char, '"' by default
load.metadata.escape - escaping char '\' by default
load.metadata.write - Append to or overwrite existing data
load.metadata.partition.attributes attributes
#
Values
Actions
{{$index+1}}.
.load.metadata.sink
_
load.metadata.sink.type - Where to sink the data
load.metadata.sink.name - Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.
load.metadata.sink.id - ES: Attribute to use as id of the document. Generated by Elasticsearch if not specified.
load.metadata.sink.timestamp - BQ: The timestamp column to use for table partitioning if any. No partitioning by default\nES:Timestamp field format as expected by Elasticsearch ("{beginTs|yyyy.MM.dd}" for example).
load.metadata.sink.location - BQ: Database location (EU, US, ...)
load.metadata.sink.clustering - BQ: List of ordered columns to use for table clustering
load.metadata.sink.days - BQ: Number of days before this table is set as expired and deleted. Never by default.
load.metadata.sink.requirePartitionFilter - BQ: Should be require a partition filter on every request ? No by default.
load.metadata.sink.connection - JDBC: Connection String
load.metadata.sink.partitions - JDBC: Number of Spark partitions
load.metadata.sink.batchSize - JDBC: Batch size of each JDBC bulk insert
load.metadata.ignore - Pattern to ignore or UDF to apply to ignore some lines
load.metadata.clustering List of attributes to use for clustering
#
Values
Actions
{{$index+1}}.
.load.metadata.xml
_
load.comment - Domain Description (free text)
load.ack - Ack extension used for each file. ".ack" if not specified. Files are moved to the pending folder only once a file with the same name as the source file and with this extension is present. To move a file without requiring an ack file to be present, set explicitly this property to the empty string value "".
load.schemaRefs List of files containing the schemas. Should start with an '_' and be located in the same folder.
#
Values
Actions
{{$index+1}}.
load.schemas List of schemas for each dataset in this domain. A domain usually contains multiple schemas. Each schema defining how the contents of the input file should be parsed. See Schema for more details.
#
name
pattern
metadata.partition.sampling
metadata.mode
metadata.format
metadata.encoding
metadata.multiline
metadata.array
metadata.withHeader
metadata.separator
Actions
{{$index+1}}.
{{row.name}}
{{row.pattern}}
{{row.metadata.partition.sampling}}
{{row.metadata.mode}}
{{row.metadata.format}}
{{row.metadata.encoding}}
{{row.metadata.multilineSelected.DisplayText}}
{{row.metadata.arraySelected.DisplayText}}
{{row.metadata.withHeaderSelected.DisplayText}}
{{row.metadata.separator}}
load.extensions recognized filename extensions. json, csv, dsv, psv are recognized by default. Only files with these extensions will be moved to the pending folder.
#
Values
Actions
{{$index+1}}.
.assertions
_
.views
_
schemas List of schemas for each dataset in this domain. A domain usually contains multiple schemas. Each schema defining how the contents of the input file should be parsed. See Schema for more details.
#
name
pattern
metadata.partition.sampling
metadata.mode
metadata.format
metadata.encoding
metadata.multiline
metadata.array
metadata.withHeader
metadata.separator
Actions
{{$index+1}}.
{{row.name}}
{{row.pattern}}
{{row.metadata.partition.sampling}}
{{row.metadata.mode}}
{{row.metadata.format}}
{{row.metadata.encoding}}
{{row.metadata.multilineSelected.DisplayText}}
{{row.metadata.arraySelected.DisplayText}}
{{row.metadata.withHeaderSelected.DisplayText}}
{{row.metadata.separator}}
{{repoTitle.MainEntity}}
×
name
primitiveType
pattern
zone - Useful for timestamp / dates
sample
comment
indexMapping
{{repoTitle.MainEntity}}
sql - Main SQL request to execute (do not forget to prefix table names with the database name to avoid conflicts)
engine - SPARK or BQ. Default value is SPARK.
domain - Output domain in output Area (Will be the Database name in Hive or Dataset in BigQuery)
dataset - Dataset Name in output Area (Will be the Table name in Hive & BigQuery)
write - Append to or overwrite existing data
area - Target Area where domain / dataset will be stored.
partition List of columns used for partitioning the output.
#
Values
Actions
{{$index+1}}.
presql List of SQL requests to executed before the main SQL request is run
#
Values
Actions
{{$index+1}}.
postsql List of SQL requests to executed after the main SQL request is run
#
Values
Actions
{{$index+1}}.
.sink
_transform_tasks
sink.type - Where to sink the data
sink.name - Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.
sink.id - ES: Attribute to use as id of the document. Generated by Elasticsearch if not specified.
sink.timestamp - BQ: The timestamp column to use for table partitioning if any. No partitioning by default\nES:Timestamp field format as expected by Elasticsearch ("{beginTs|yyyy.MM.dd}" for example).
sink.location - BQ: Database location (EU, US, ...)
sink.clustering - BQ: List of ordered columns to use for table clustering
sink.days - BQ: Number of days before this table is set as expired and deleted. Never by default.
sink.requirePartitionFilter - BQ: Should be require a partition filter on every request ? No by default.
sink.connection - JDBC: Connection String
sink.partitions - JDBC: Number of Spark partitions
sink.batchSize - JDBC: Batch size of each JDBC bulk insert
rls rls
#
name
predicate
Actions
{{$index+1}}.
{{row.name}}
{{row.predicate}}
.assertions
_transform_tasks
{{repoTitle.MainEntity}}
name - This Row Level Security unique name
predicate - The condition that goes to the WHERE clause and limit the visible rows.
grants user / groups / service accounts to which this security level is applied. ex : user:me@mycompany.com,group:group@mycompany.com,serviceAccount:mysa@google-accounts.com
#
Values
Actions
{{$index+1}}.
{{repoTitle.MainEntity}}
name - Schema name, must be unique among all the schemas belonging to the same domain. * Will become the hive table name On Premise or BigQuery Table name on GCP.
pattern - filename pattern to which this schema must be applied. * This instructs the framework to use this schema to parse any file with a filename that match this pattern.
primaryKey List of columns that make up the primary key
#
Values
Actions
{{$index+1}}.
attributes Attributes parsing rules.
#
name
type
foreignKey
array
required
privacy
comment
rename
metricType
position.first
Actions
{{$index+1}}.
{{row.name}}
{{row.type}}
{{row.foreignKey}}
{{row.arraySelected.DisplayText}}
{{row.requiredSelected.DisplayText}}
{{row.privacy}}
{{row.comment}}
{{row.rename}}
{{row.metricType}}
{{row.position.first}}
.metadata
_load_schemas
.metadata.partition
_load_schemas
metadata.partition.sampling - 0.0 means no sampling, > 0 && < 1 means sample dataset, >=1 absolute number of partitions.
metadata.mode - FILE mode by default.\nFILE and STREAM are the two accepted values.\nFILE is currently the only supported mode.
metadata.format - DSV by default. Supported file formats are :\n- DSV : Delimiter-separated values file. Delimiter value is specified in the "separator" field.\n- POSITION : FIXED format file where values are located at an exact position in each line.\n- SIMPLE_JSON : For optimisation purpose, we differentiate JSON with top level values from JSON\n with deep level fields. SIMPLE_JSON are JSON files with top level fields only.\n- JSON : Deep JSON file. Use only when your json documents contain subdocuments, otherwise prefer to\n use SIMPLE_JSON since it is much faster.\n- XML : XML files
metadata.encoding - UTF-8 if not specified.
metadata.multiline - are json objects on a single line or multiple line ? Single by default. false means single. false also means faster
metadata.array - Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.
metadata.withHeader - does the dataset has a header ? true bu default
metadata.separator - the values delimiter, ';' by default value may be a multichar string starting from Spark3
metadata.quote - The String quote char, '"' by default
metadata.escape - escaping char '\' by default
metadata.write - Append to or overwrite existing data
metadata.partition.attributes attributes
#
Values
Actions
{{$index+1}}.
.metadata.sink
_load_schemas
metadata.sink.type - Where to sink the data
metadata.sink.name - Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.
metadata.sink.id - ES: Attribute to use as id of the document. Generated by Elasticsearch if not specified.
metadata.sink.timestamp - BQ: The timestamp column to use for table partitioning if any. No partitioning by default\nES:Timestamp field format as expected by Elasticsearch ("{beginTs|yyyy.MM.dd}" for example).
metadata.sink.location - BQ: Database location (EU, US, ...)
metadata.sink.clustering - BQ: List of ordered columns to use for table clustering
metadata.sink.days - BQ: Number of days before this table is set as expired and deleted. Never by default.
metadata.sink.requirePartitionFilter - BQ: Should be require a partition filter on every request ? No by default.
metadata.sink.connection - JDBC: Connection String
metadata.sink.partitions - JDBC: Number of Spark partitions
metadata.sink.batchSize - JDBC: Batch size of each JDBC bulk insert
metadata.ignore - Pattern to ignore or UDF to apply to ignore some lines
metadata.clustering List of attributes to use for clustering
#
Values
Actions
{{$index+1}}.
.metadata.xml
_load_schemas
.merge
_load_schemas
merge.delete - Optional valid sql condition on the incoming dataset. Use renamed column here.
merge.timestamp - Timestamp column used to identify last version, if not specified currently ingested row is considered the last
merge.queryFilter
comment - free text
merge.key list of attributes to join existing with incoming dataset. Use renamed columns here.
#
Values
Actions
{{$index+1}}.
presql Reserved for future use.
#
Values
Actions
{{$index+1}}.
postsql Reserved for future use.
#
Values
Actions
{{$index+1}}.
tags Set of string to attach to this Schema
#
Values
Actions
{{$index+1}}.
rls Experimental. Row level security to this to this schema.
#
name
predicate
Actions
{{$index+1}}.
{{row.name}}
{{row.predicate}}
.assertions
_load_schemas
{{repoTitle.MainEntity}}
name - Attribute name as defined in the source dataset and as received in the file
type - semantic type of the attribute
foreignKey - If this attribute is a foreign key, reference to [domain.]table[.attribute]
array - Is it an array ?
required - Should this attribute always be present in the source
privacy - Should this attribute be applied a privacy transformation at ingestion time
comment - free text for attribute description
rename - If present, the attribute is renamed with this name
metricType - If present, what kind of stat should be computed for this field
attributes List of sub-attributes (valid for JSON and XML files only)
#
name
type
foreignKey
array
required
privacy
comment
rename
metricType
position.first
Actions
{{$index+1}}.
{{row.name}}
{{row.type}}
{{row.foreignKey}}
{{row.arraySelected.DisplayText}}
{{row.requiredSelected.DisplayText}}
{{row.privacy}}
{{row.comment}}
{{row.rename}}
{{row.metricType}}
{{row.position.first}}
.position
_attributes
position.first
position.last
default - Default value for this attribute when it is not present.
trim
script - Scripted field : SQL request on renamed column
tags Tags associated with this attribute
#
Values
Actions
{{$index+1}}.
{{repoTitle.MainEntity}}
name - Schema name, must be unique among all the schemas belonging to the same domain. * Will become the hive table name On Premise or BigQuery Table name on GCP.
pattern - filename pattern to which this schema must be applied. * This instructs the framework to use this schema to parse any file with a filename that match this pattern.
primaryKey List of columns that make up the primary key
#
Values
Actions
{{$index+1}}.
attributes Attributes parsing rules.
#
name
type
foreignKey
array
required
privacy
comment
rename
metricType
position.first
Actions
{{$index+1}}.
{{row.name}}
{{row.type}}
{{row.foreignKey}}
{{row.arraySelected.DisplayText}}
{{row.requiredSelected.DisplayText}}
{{row.privacy}}
{{row.comment}}
{{row.rename}}
{{row.metricType}}
{{row.position.first}}
.metadata
_schemas
.metadata.partition
_schemas
metadata.partition.sampling - 0.0 means no sampling, > 0 && < 1 means sample dataset, >=1 absolute number of partitions.
metadata.mode - FILE mode by default.\nFILE and STREAM are the two accepted values.\nFILE is currently the only supported mode.
metadata.format - DSV by default. Supported file formats are :\n- DSV : Delimiter-separated values file. Delimiter value is specified in the "separator" field.\n- POSITION : FIXED format file where values are located at an exact position in each line.\n- SIMPLE_JSON : For optimisation purpose, we differentiate JSON with top level values from JSON\n with deep level fields. SIMPLE_JSON are JSON files with top level fields only.\n- JSON : Deep JSON file. Use only when your json documents contain subdocuments, otherwise prefer to\n use SIMPLE_JSON since it is much faster.\n- XML : XML files
metadata.encoding - UTF-8 if not specified.
metadata.multiline - are json objects on a single line or multiple line ? Single by default. false means single. false also means faster
metadata.array - Is the json stored as a single object array ? false by default. This means that by default we have on json document per line.
metadata.withHeader - does the dataset has a header ? true bu default
metadata.separator - the values delimiter, ';' by default value may be a multichar string starting from Spark3
metadata.quote - The String quote char, '"' by default
metadata.escape - escaping char '\' by default
metadata.write - Append to or overwrite existing data
metadata.partition.attributes attributes
#
Values
Actions
{{$index+1}}.
.metadata.sink
_schemas
metadata.sink.type - Where to sink the data
metadata.sink.name - Once ingested, files may be sinked to BigQuery, Elasticsearch or any JDBC compliant Database.
metadata.sink.id - ES: Attribute to use as id of the document. Generated by Elasticsearch if not specified.
metadata.sink.timestamp - BQ: The timestamp column to use for table partitioning if any. No partitioning by default\nES:Timestamp field format as expected by Elasticsearch ("{beginTs|yyyy.MM.dd}" for example).
metadata.sink.location - BQ: Database location (EU, US, ...)
metadata.sink.clustering - BQ: List of ordered columns to use for table clustering
metadata.sink.days - BQ: Number of days before this table is set as expired and deleted. Never by default.
metadata.sink.requirePartitionFilter - BQ: Should be require a partition filter on every request ? No by default.
metadata.sink.connection - JDBC: Connection String
metadata.sink.partitions - JDBC: Number of Spark partitions
metadata.sink.batchSize - JDBC: Batch size of each JDBC bulk insert
metadata.ignore - Pattern to ignore or UDF to apply to ignore some lines
metadata.clustering List of attributes to use for clustering
#
Values
Actions
{{$index+1}}.
.metadata.xml
_schemas
.merge
_schemas
merge.delete - Optional valid sql condition on the incoming dataset. Use renamed column here.
merge.timestamp - Timestamp column used to identify last version, if not specified currently ingested row is considered the last
merge.queryFilter
comment - free text
merge.key list of attributes to join existing with incoming dataset. Use renamed columns here.
#
Values
Actions
{{$index+1}}.
presql Reserved for future use.
#
Values
Actions
{{$index+1}}.
postsql Reserved for future use.
#
Values
Actions
{{$index+1}}.
tags Set of string to attach to this Schema
#
Values
Actions
{{$index+1}}.
rls Experimental. Row level security to this to this schema.
#
name
predicate
Actions
{{$index+1}}.
{{row.name}}
{{row.predicate}}
.assertions
_schemas
Download Json
{{message}}