Importing via Exabel SDK

Import data in production pipelines

The Exabel SDK provides an easy way to programmatically import data in bulk, and is recommended for use in production data pipelines. Under the hood, it uses the Data API.

This page describes how to use the Exabel SDK on the command line to import entities, relationships, signals and time series.

📘

Requires SDK version ≥4.0

The instructions on this page require use of the Exabel SDK version ≥4.0.

Installation

pip install exabel-data-sdk

Or download from PyPI. The SDK requires Python 3.7 or later.

See the installation instructions on GitHub for other configuration options when installing the SDK.

Command line use

We recommend using the SDK through the command line - click here for a full list of scripts.

In general, the scripts may be run from the command line as follows:

python -m exabel_data_sdk.scripts.<script> <arguments>

The 3 main import scripts described below operate on CSV files, and require the following arguments:

  • --api-key: your customer-level API key. Alternatively, set the EXABEL_API_KEY environment variable.
  • --filename: path to the CSV file to import

The following are optional arguments:

  • --help: print the help message with a full list of arguments
  • --sep: CSV delimiter (defaults to comma ,)
  • --dry-run: if True, prints to the command line without making any API calls
  • --threads: the number of parallel import threads (defaults to 40, but may be 1-100)

Technical concepts

You may want to recap Exabel's key concepts, particularly entities, relationships, signals and time series.

Resource name

Entities, relationships and signals are referenced by a resource name that uniquely identifies each resource. When importing these, the Exabel SDK requires you to provide unique names that will be used as resource names.

📘

Exabel SDK command-line scripts normalize imported names

The Exabel SDK command-line scripts "normalize" your imported names to match resource name conventions, by removing any spaces and invalid characters.

Valid resource names are:

  • Entity names can contain letters (upper/lowercase), numbers, hyphens & underscores; cannot start with a hyphen; and can be up to 64 characters long.
    Eg: skoda (regex: \w[\w-]{0,63})
  • Relationship names must start with an uppercase letter; can contain uppercase letters, numbers, and underscores; and can be up to 64 characters long.
    Eg: HAS_BRAND (regex: [A-Z][A-Z0-9_]{0,63})
  • Signal names must start with a letter (upper/lowercase); can contain letters, numbers, and underscores; and can be up to 64 characters long.
    Eg: sales_index (regex: [a-zA-Z]\w{0,63})

If your data files contain names with invalid characters, the command-line scripts will:

  1. Replace all invalid characters with an underscore (_)
  2. Shorten multiple sequential underscores to a single underscore
  3. Truncate names to be at most 64 characters

Note 1: if you are using the Python client from the Exabel SDK (not recommended, and not described on this page), or the Exabel Data API directly, then you will need do this normalization yourself.

Note 2: the automatic truncation of long resource names to 64 characters may result in multiple resources having the same name - we are unable to check for this in the command-line scripts. Exabel recommends that you normalize the names yourself if this is a possibility.

While the Exabel platform's canonical resource names include your namespace (eg customer1.skoda), when using most command-line scripts, the namespace is retrieved automatically from the Data API, and the namespace should not be provided within your data.

👍

Best practice: normalize resource names before data import

Entity resource names should be stable and consistent identifiers over time. If an entity changes display name, this should not result in a new entity being created in the data model. It is therefore important to be mindful when choosing a field as the entity column in your data source.

Normalizing your names into resource names before data import gives you full control and visibility into the actual identifiers used, and allowing you to ensure that the resource names are unique.

For example, see Importing entities - Example 2

Display name & description

Entities and signals may optionally be imported with more human-readable display names.

Entities, signals and relationships may also optionally be imported with descriptions.

Both display names and descriptions are used in the Exabel user interface where possible.

Check your identifier coverage

Exabel supports a range of identifiers with which you may map your data to companies. Not all identifiers will map to a company and sometimes multiple identifiers map to the same company.

To identify such cases Exabel provides a script to check the validity of your identifiers.

Run the following:

python -m exabel_data_sdk.scripts.check_company_identifiers_in_csv \
        --api-key="my_api_key"
        --filename="my_identifiers.csv" \
        --identifier-column="factset_identifier"

Arguments

  • --filename: a file containing a column with identifiers
  • --identifier-column: column for identifiers supported by Exabel

Create entity types

If you are importing entities that do not belong to the list of pre-defined global entity types, you will need to create the custom entity types first. You can do this by running the create_entity_type command line script in the Exabel SDK. You will need to specify if the entity type is associative or not.

Run the following:

python -m exabel_data_sdk.scripts.create_entity_type \
	--api-key="my_api_key" \
	--name="entityTypes/my_namespace.my_entity_type" \
	--display-name="My Entity Type" \
	--description="This is my entity type" \
	[ --is-associative | --no-is-associative ]

Arguments

  • --name: the resource name of the new entity type
  • --display-name: the display name of the new entity type
  • --description: text description of the new entity type
  • --is-associative | --no-is-associative: either the entity type is associative or not

Create relationship types

You will need to create the relationship types in your model relationship types by running the create_relationship_type command line script in the Exabel SDK. You will need to make a decision on whether the relationship should be an ownership relationship or not (Ownership relationships).

Run the following

python -m exabel_data_sdk.scripts.create_relationship_type \
	--api-key="my_api_key" \
	--name="relationshipType/my_namespace.my_relationship_type" \
	--description="This is my relationship type" \
	[ --is-ownership | --no-is-ownership ]

Arguments

  • --name: the resource name of the new relationship type
  • --description: text description of the new relationship type
  • --is-ownership | --no-is-ownership: either the relationship type is an ownership relationship or not

Importing entities

Import entities by running the load_entities_from_csv command line script in the Exabel SDK.

Arguments

  • --entity-type: (optional) type of entities being imported; if unspecified, defaults to the value of name_column
  • --name-column: (optional) column for entity resource names; if unspecified, defaults to the first column
  • --display-name-column: (optional) column for entity display names; if unspecified, defaults to the second column if present, otherwise the resource name
  • --description-column: (optional) column for entity descriptions
  • --upsert: (optional, default False) if True, updates entities if they already exist

Example 1

The simplest example has a CSV of entities with a single column, where the column name (brand) corresponds to the entity type being imported, relying on the Exabel SDK to normalize brand names into resource names:

entity_id
Spring & Vine
The Coconut Tree

Run the following:

python -m exabel_data_sdk.scripts.load_entities_from_csv \
	--api-key="my_api_key" \
	--filename="entities.csv" \
	--entity-type="my_namespace.merchant"

Example 2

This example structures the CSV with 3 columns specifying the normalized resource name (id), display name (display_name), and description (description) explicitly.

id,display_name,description
merchant_001,Spring & Vine,Shampoo bars
merchant_002,The Coconut Tree,Sri Lankan street food

Run the following:

python -m exabel_data_sdk.scripts.load_entities_from_csv \
	--api-key="my_api_key" \
	--filename="entities.csv" \
	--entity-type="my_namespace.merchant" \
	--description-column="brand_description"

Importing relationships

Import relationships by running the load_relationships_from_csv command line script in the Exabel SDK.

Arguments

  • --relationship-type: type of relationships being imported
  • --from-entity-column: (optional) column for the entity from which the relationship originates; if unspecified, defaults to the first column in the file
  • --from-entity-type: (optional) entity type of the entity from which the relationship originates; if unspecified, the entity type will be inferred from the column name
  • --from-identifier-type: (optional) the identifier type used to map the entity from which the relationship originates. If --from-entity-type is specified, this argument must be set in order to look up entities by an identifier
  • --to-entity-column: (optional) column for the entity to which the relationship goes; if unspecified, defaults to the second column in the file
  • --to-entity-type: (optional) entity type of the entity to which the relationship goes; if unspecified, the entity type will be inferred from the column name
  • --to-identifier-type: (optional) the identifier type used to map the entity to which the relationship goes. If --to-entity-type is specified, this argument must be set in order to look up entities by an identifier
  • --description-column: (optional) column for relationship descriptions
  • --upsert: (optional, default False) if True, updates relationships if they already exist

Example 1: link to companies

A common use case is to create relationships that link your own custom entities to pre-loaded company entities (from Factset) available on the Exabel platform. (For more detail - see Mapping to Companies & Securities.)

You may reference pre-loaded companies by using these identifiers as column headers in your CSV:

  • bloomberg_ticker or bloomberg_symbol (eg AAPL US or AAPL US Equity for Apple Inc.)
  • mic:ticker (eg XNAS:AAPL for Apple Inc.)
  • figi(eg BBG001S5N8V8 for Apple Inc.)
  • isin (eg US0378331005 for Apple Inc.)
  • factset_identifier (eg 000C7F-E for Apple Inc.)

Exabel will map these identifiers to the correct company entity, and create the relationship between your entity and the identified company.

This example uses Factset IDs to identify the company entities. Structure your CSV to include 2 columns, with 1 row of data for each relationship, and use the same entity names from the entity import:

factset_identifier,merchant
00001-E,Spring & Vine
00002-E,The Coconut Tree

The relationship type and direction of the relationship are specified through command line arguments to the import script, which can be run as follows:

python -m exabel_data_sdk.scripts.load_relationships_from_csv \
	--api-key="my_api_key" \
	--filename="relationships.csv" \
	--relationship-type="my_namespace.HAS_MERCHANT" \
	--from-entity-type="company" \
  --from-identifier-type="factset_identifier" \
	--to-entity-type="my_namespace.merchant"

🚧

Relationships must reference the same entity resource names previously imported

If you are importing your own custom entities, it is important that you use the same entity resource names when importing both entities and relationships.

If you chose to import un-normalized entity names and let the Exabel SDK handle normalization (Entities example 1 above), you should use the same un-normalized entity names in your relationships data file.

Example 2: link your entities

Another common use case is to create relationships linking your own custom entities. The process is similar - structure your CSV with 2 columns for the "from" and "to" entities:

from,to
merchant_001,merchant_001-NL
merchant_002,merchant_002-TH

And then run the following:

python -m exabel_data_sdk.scripts.load_relationships_from_csv \
	--api-key="my_api_key" \
	--filename="relationships.csv" \
	--relationship-type="my_namespace.LOCATED_IN" \
	--from-entity-type="my_namespace.merchant" \
	--to-entity-type="my_namespace.merchant_and_country"

Importing signals & time series

Import time series by running the load_time_series_from_csv command line script in the Exabel SDK. In your data, you must specify which signal each imported time series belongs to.

You can ask the command line script to create signals for you using the --create-missing-signals argument (recommended), or create these signals yourself (with the Exabel SDK Python client, or via the Data API).

Arguments

  • --entity-type: (optional) entity type of the entities; if unspecified, the entity type will be inferred from the entity column name
  • --identifier-type: (optional) the identifier type used to map the entities. If --entity-type is specified, this argument must be set in order to look up entities by an identifier
  • --create-missing-signals: (optional, default False) if True, automatically creates signals that do not already exist
  • --pit-current-time: (optional, default False) if True, sets the known time for all imported data points to be the time at which it is inserted into the Exabel platform
  • --pit-offset: (optional) sets the known time to an offset of 0-30 days, applied to the date of each data point.

Point-in-time data

When importing time series, you must specify the known time at which each data point was known. This is important and used to avoid look-ahead bias when analyzing data.

Learn more about Exabel's native point-in-time support here.

There are 3 options for specifying known times:

  1. Example 1: Provide data with an additional known_time column (recommended for historical backloads)
  2. Example 2: Specify an offset of 0-30 days, applied to the date of each data point (recommended for historical backloads)
  3. Example 3: Set the known time to be the import run time (recommended for live production pipelines)

Example 1: backload with specific known times

In this example, you provide a specific known time for every data point. This provides maximum flexibility to handle data which may have been revised over time at an irregular cadence.

Structure your CSV as follows:

entity_id,date,known_time,sales_actual,sales_estimate
merchant_001,2020-10-01,2020-11-02,100000,50000
merchant_002,2020-05-31,2020-06-02,,20000
  • Column 1: entity column, with entity resource names as values
  • Column 2: date column, with ISO-8601 dates (not datetimes). The column header must be date.
  • Column 3: known_time column, with ISO-8601 dates. The column header must be known_time.
  • Columns 4, ...: one column for each signal that you import, with time series data for the corresponding entity in each row. Values may be blank if there is no data for that entity-date-signal combination, as in the example.

Then run the following:

python -m exabel_data_sdk.scripts.load_time_series_from_file \
	--api-key="my_api_key" \
	--filename="time_series.csv" \
	--entity-type="my_namespace.merchant"

Example 2: backload with point-in-time offset

In this example, you provide only the date for each data point and a point-in-time offset, in units of days, to apply to each date. This is best suited to historical backloads where you may not have had the historical .

For instance, if we specify an offset of 2 days, for Spring & Vine below, the sales_actual data point (date of 2020-10-01) will have a known time of 2020-10-03.

Structure your CSV as in Example 1, but without a column for known time:

entity_id,date,sales_actual,sales_estimate
merchant_001,2020-10-01,100000,50000
merchant_002,2020-05-31,,20000

Then run the following to specify an offset of 2 days:

python -m exabel_data_sdk.scripts.load_time_series_from_file \
	--api-key="my_api_key" \
	--filename="time_series.csv" \
	--entity-type="my_namespace.merchant" \
  --pit-offset=2

Example 3: live pipeline

For live production pipelines, we recommend setting the known time to be the import run time. Structure your CSV as in Example 2:

entity_id,date,sales_actual,sales_estimate
merchant_001,2020-10-01,100000,50000
merchant_002,2020-05-31,,20000

Then run the following:

python -m exabel_data_sdk.scripts.load_time_series_from_file \
	--api-key="my_api_key" \
	--filename="time_series.csv" \
	--entity-type="my_namespace.merchant" \
  --pit-current-time

The known time that is set will be the time on the Exabel system when the data is inserted into our database. (Your system time will not affect this.)

👍

Best practices for live data pipelines

  • Use --pit-current-time
  • Import only new or updated data points - this will make the import run much more quickly as there will be fewer data points to process.

If you are unable to filter out historical data points that are unchanged, this is ok - the Exabel platform accept imports of full historical time series, even if many of the data points already exist.

Long-form vs wide-form

The examples above use data formatted in wide-form - multiple signals are arranged in columns, aligned to the entities and dates in each row. This is useful and more compact when your signal data is dense and have the same frequencies, such that all signals have data points for each entity-and-date combination.

If your signals have differing frequencies (eg weekly and monthly signals), or if you want to import null values to override values with an earlier known-time, we recommend that you import data in long-form.

👍

Long-form data is automatically detected

If your data is structured as per the example below, with entities in the first column, and date, known_time (optional), signal and value columns, the command-line script will process your file as long-form data.

The same data from example #1 above should be structured like this:

entity_id,date,known_time,signal,value
merchant_001,2020-10-01,2020-11-02,sales_actual,100000
merchant_001,2020-10-01,2020-11-02,sales_estimate,50000
merchant_002,2020-05-31,2020-06-02,sales_estimate,20000
  • Column 1 must still be the entity column
  • Column 2 must still be the date column
  • The signal and value columns are required
  • The known_time column shown here is optional

Importing null values

If your data contains null, unknown, missing or deleted values, these can be uploaded using long-form CSVs.

🚧

Differences in null handling: long-form vs wide-form

When handling long-form data, the Exabel command-line scripts accept null values, which may override existing values.

When handling wide-form data, the Exabel command-line scripts ignore null values in the CSV, and does not import these into Exabel.

📘

Example long-form CSV with null values

In the example file below, there exists a sales estimate on May 1 as the mean of two brokers' estimates. However on May 2, no brokers are reporting sales estimate for the company, and thus no valid sales estimate value exist on May 2. This is represented as a blank value in the CSV. On May 3 again, there is a valid sales estimate and a value reappears. See Point-in-time data for details about the use of known-time in this scenario.

entity_id,date,known_time,signal,value
merchant_002,2020-05-31,2020-05-01,consensus_sales_estimate,20000
merchant_002,2020-05-31,2020-05-02,consensus_sales_estimate,
merchant_002,2020-05-31,2020-05-03,consensus_sales_estimate,19000
merchant_002,2020-05-31,2020-05-01,number_of_brokers,2
merchant_002,2020-05-31,2020-05-02,number_of_brokers,0
merchant_002,2020-05-31,2020-05-03,number_of_brokers,1

Importing time series on the global entity

It is also possible to connect time series to the "global" entity when uploading time series. No entity column is necessary in the input file but it is required to specify the argument --entity-type="global". Example input file:

date;global_signal
2021-01-01;1
2021-01-02;2

Then run the following:

python -m exabel_data_sdk.scripts.load_time_series_from_file \
	--api-key="my_api_key" \
	--filename="time_series.csv" \
	--entity-type="global" \
  --pit-current-time

👍

Best practice for global time series

Think carefully before loading time series to the global entity, are the raw signals truly global for your use case? If you find yourself loading multiple signals like sales_europe, sales_americas, sales_africa and so on, you should probably introduce a region entity type with regions (Europe, Americas, Africa) as entities, and load the time series to a sales signal on these entities instead. See data modelling for more examples.