Import data

Import entities, relationships, signals and time series via the Exabel SDK

The Exabel SDK provides an easy way to programmatically import data in bulk, and is recommended for use in production data pipelines. Under the hood, it uses the Data API.

This page describes how to use the SDK on the command line to import entities, relationships, signals and time series. It assumes you have completed Setup (install the SDK and configure authentication).

📘

Requires SDK version ≥4.0

The instructions on this page require the Exabel SDK version ≥4.0.

Command line use

We recommend using the SDK through the command line — click here for a full list of scripts.

In general, the scripts may be run from the command line as follows:

python -m exabel.scripts.<script> <arguments>

The 3 main import scripts described below operate on CSV files and require the following arguments:

  • --api-key: your customer-level API key. Alternatively, set the EXABEL_API_KEY environment variable. See Setup → Authentication.
  • --filename: path to the CSV file to import

Optional arguments:

  • --help: print the help message with a full list of arguments
  • --sep: CSV delimiter (defaults to comma ,)
  • --dry-run: if True, prints to the command line without making any API calls
  • --threads: the number of parallel import threads (defaults to 40, but may be 1-100)

Technical concepts

You may want to recap Exabel's key concepts, particularly entities, relationships, signals and time series.

Resource name

Entities, relationships and signals are referenced by a resource name that uniquely identifies each resource. When importing these, the Exabel SDK requires you to provide unique names that will be used as resource names.

📘

Exabel SDK command-line scripts normalize imported names

The Exabel SDK command-line scripts "normalize" your imported names to match resource name conventions, by removing any spaces and invalid characters.

Valid resource names are:

  • Entity names can contain letters (upper/lowercase), numbers, hyphens & underscores; cannot start with a hyphen; and can be up to 64 characters long.
    Eg: skoda (regex: \w[\w-]{0,63})
  • Relationship names must start with an uppercase letter; can contain uppercase letters, numbers, and underscores; and can be up to 64 characters long.
    Eg: HAS_BRAND (regex: [A-Z][A-Z0-9_]{0,63})
  • Signal names must start with a letter (upper/lowercase); can contain letters, numbers, and underscores; and can be up to 64 characters long.
    Eg: sales_index (regex: [a-zA-Z]\w{0,63})

If your data files contain names with invalid characters, the command-line scripts will:

  1. Replace all invalid characters with an underscore (_)
  2. Shorten multiple sequential underscores to a single underscore
  3. Truncate names to be at most 64 characters

Note 1: if you are using the Python client from the Exabel SDK (not recommended, and not described on this page), or the Exabel Data API directly, then you will need do this normalization yourself.

Note 2: the automatic truncation of long resource names to 64 characters may result in multiple resources having the same name — we are unable to check for this in the command-line scripts. Exabel recommends that you normalize the names yourself if this is a possibility.

While the Exabel platform's canonical resource names include your namespace (eg customer1.skoda), when using most command-line scripts, the namespace is retrieved automatically from the Data API, and the namespace should not be provided within your data.

👍

Best practice: normalize resource names before data import

Entity resource names should be stable and consistent identifiers over time. If an entity changes display name, this should not result in a new entity being created in the data model. It is therefore important to be mindful when choosing a field as the entity column in your data source.

Normalizing your names into resource names before data import gives you full control and visibility into the actual identifiers used, and allowing you to ensure that the resource names are unique.

For example, see Importing entities — Example 2.

Display name & description

Entities and signals may optionally be imported with more human-readable display names.

Entities, signals and relationships may also optionally be imported with descriptions.

Both display names and descriptions are used in the Exabel user interface where possible.

Check your identifier coverage

Exabel supports a range of identifiers with which you may map your data to companies. Not all identifiers will map to a company and sometimes multiple identifiers map to the same company.

To identify such cases Exabel provides a script to check the validity of your identifiers:

python -m exabel.scripts.check_company_identifiers_in_csv \
        --api-key="my_api_key" \
        --filename="my_identifiers.csv" \
        --identifier-column="factset_identifier"

Arguments

  • --filename: a file containing a column with identifiers
  • --identifier-column: column for identifiers supported by Exabel

Create entity types

If you are importing entities that do not belong to the list of pre-defined global entity types, you will need to create the custom entity types first using create_entity_type. You will need to specify if the entity type is associative or not.

python -m exabel.scripts.create_entity_type \
    --api-key="my_api_key" \
    --name="entityTypes/my_namespace.my_entity_type" \
    --display-name="My Entity Type" \
    --description="This is my entity type" \
    [ --is-associative | --no-is-associative ]

Arguments

  • --name: the resource name of the new entity type
  • --display-name: the display name of the new entity type
  • --description: text description of the new entity type
  • --is-associative | --no-is-associative: whether the entity type is associative

Create relationship types

Create the relationship types in your model relationship types by running create_relationship_type. Decide whether the relationship should be an ownership relationship or not.

python -m exabel.scripts.create_relationship_type \
    --api-key="my_api_key" \
    --name="relationshipType/my_namespace.my_relationship_type" \
    --description="This is my relationship type" \
    [ --is-ownership | --no-is-ownership ]

Arguments

  • --name: the resource name of the new relationship type
  • --description: text description of the new relationship type
  • --is-ownership | --no-is-ownership: whether the relationship type is an ownership relationship

Importing entities

Import entities by running load_entities_from_csv.

Arguments

  • --entity-type: (optional) type of entities being imported; if unspecified, defaults to the value of entity_column
  • --entity-column: (optional) column for entity resource names; if unspecified, defaults to the first column
  • --display-name-column: (optional) column for entity display names; if unspecified, defaults to the second column if present, otherwise the resource name
  • --description-column: (optional) column for entity descriptions
  • --upsert: (optional, default False) if True, updates entities if they already exist

Example 1

The simplest example has a CSV of entities with a single column, where the column name (brand) corresponds to the entity type being imported, relying on the Exabel SDK to normalize brand names into resource names:

entity_id
Spring & Vine
The Coconut Tree
python -m exabel.scripts.load_entities_from_csv \
    --api-key="my_api_key" \
    --filename="entities.csv" \
    --entity-type="my_namespace.merchant"

Example 2

This example structures the CSV with 3 columns specifying the normalized resource name (id), display name (display_name), and description (description) explicitly:

id,display_name,description
merchant_001,Spring & Vine,Shampoo bars
merchant_002,The Coconut Tree,Sri Lankan street food
python -m exabel.scripts.load_entities_from_csv \
    --api-key="my_api_key" \
    --filename="entities.csv" \
    --entity-type="my_namespace.merchant" \
    --description-column="brand_description"

Importing relationships

Import relationships by running load_relationships_from_csv.

Arguments

  • --relationship-type: type of relationships being imported
  • --from-entity-column: (optional) column for the entity from which the relationship originates; defaults to the first column
  • --from-entity-type: (optional) entity type of the from entity; if unspecified, inferred from the column name
  • --from-identifier-type: (optional) the identifier type used to map the from entity. Required when --from-entity-type is set and you are addressing entities by identifier.
  • --to-entity-column: (optional) column for the entity to which the relationship goes; defaults to the second column
  • --to-entity-type: (optional) entity type of the to entity; if unspecified, inferred from the column name
  • --to-identifier-type: (optional) the identifier type used to map the to entity. Required when --to-entity-type is set and you are addressing entities by identifier.
  • --description-column: (optional) column for relationship descriptions
  • --upsert: (optional, default False) if True, updates relationships if they already exist

Example 1: link to companies

A common use case is to create relationships that link your own custom entities to pre-loaded company entities (from Factset) available on the Exabel platform. (For more detail — see Mapping to Companies & Securities.)

You may reference pre-loaded companies by using these identifiers as column headers in your CSV:

  • bloomberg_ticker or bloomberg_symbol (eg AAPL US or AAPL US Equity for Apple Inc.)
  • mic:ticker (eg XNAS:AAPL for Apple Inc.)
  • figi (eg BBG001S5N8V8 for Apple Inc.)
  • isin (eg US0378331005 for Apple Inc.)
  • factset_identifier (eg 000C7F-E for Apple Inc.)

Exabel will map these identifiers to the correct company entity, and create the relationship between your entity and the identified company.

factset_identifier,merchant
00001-E,Spring & Vine
00002-E,The Coconut Tree
python -m exabel.scripts.load_relationships_from_csv \
    --api-key="my_api_key" \
    --filename="relationships.csv" \
    --relationship-type="my_namespace.HAS_MERCHANT" \
    --from-entity-type="company" \
    --from-identifier-type="factset_identifier" \
    --to-entity-type="my_namespace.merchant"
🚧

Relationships must reference the same entity resource names previously imported

If you are importing your own custom entities, it is important that you use the same entity resource names when importing both entities and relationships.

If you chose to import un-normalized entity names and let the Exabel SDK handle normalization (Entities example 1 above), you should use the same un-normalized entity names in your relationships data file.

Example 2: link your entities

Another common use case is to create relationships linking your own custom entities.

from,to
merchant_001,merchant_001-NL
merchant_002,merchant_002-TH
python -m exabel.scripts.load_relationships_from_csv \
    --api-key="my_api_key" \
    --filename="relationships.csv" \
    --relationship-type="my_namespace.LOCATED_IN" \
    --from-entity-type="my_namespace.merchant" \
    --to-entity-type="my_namespace.merchant_and_country"

Importing signals & time series

Import time series by running load_time_series_from_file. Each imported time series must be associated with a signal.

You can ask the script to create signals for you using the --create-missing-signals argument (recommended), or create them yourself with the Exabel SDK Python client or the Data API.

Arguments

  • --entity-type: (optional) entity type of the entities; if unspecified, inferred from the entity column name
  • --identifier-type: (optional) the identifier type used to map the entities. Required when --entity-type is set and you are addressing entities by identifier.
  • --create-missing-signals: (optional, default False) if True, automatically creates signals that do not already exist
  • --pit-current-time: (optional, default False) if True, sets the known time for all imported data points to be the time at which they are inserted into the Exabel platform
  • --pit-offset: (optional) sets the known time to an offset of 0-30 days, applied to the date of each data point.

Point-in-time data

When importing time series, you must specify the known time at which each data point was known. This is important to avoid look-ahead bias when analyzing data.

Learn more about Exabel's native point-in-time support.

There are 3 options for specifying known times:

  1. Example 1: provide data with an additional known_time column (recommended for historical backloads)
  2. Example 2: specify an offset of 0-30 days, applied to the date of each data point (also for historical backloads)
  3. Example 3: set the known time to be the import run time (recommended for live production pipelines)

Example 1: backload with specific known times

In this example, you provide a specific known time for every data point. This provides maximum flexibility to handle data which may have been revised over time at an irregular cadence.

entity_id,date,known_time,sales_actual,sales_estimate
merchant_001,2020-10-01,2020-11-02,100000,50000
merchant_002,2020-05-31,2020-06-02,,20000
  • Column 1: entity column, with entity resource names as values
  • Column 2: date column, with ISO-8601 dates (not datetimes). The column header must be date.
  • Column 3: known_time column, with ISO-8601 dates. The column header must be known_time.
  • Columns 4, …: one column for each signal that you import, with time series data for the corresponding entity in each row. Values may be blank if there is no data for that entity-date-signal combination, as in the example.
python -m exabel.scripts.load_time_series_from_file \
    --api-key="my_api_key" \
    --filename="time_series.csv" \
    --entity-type="my_namespace.merchant"

Example 2: backload with point-in-time offset

In this example, you provide only the date for each data point and a point-in-time offset, in units of days, to apply to each date. This is best suited to historical backloads where you may not have the historical known times.

For instance, if we specify an offset of 2 days, for Spring & Vine below, the sales_actual data point (date of 2020-10-01) will have a known time of 2020-10-03.

Structure your CSV as in Example 1, but without a column for known time:

entity_id,date,sales_actual,sales_estimate
merchant_001,2020-10-01,100000,50000
merchant_002,2020-05-31,,20000
python -m exabel.scripts.load_time_series_from_file \
    --api-key="my_api_key" \
    --filename="time_series.csv" \
    --entity-type="my_namespace.merchant" \
    --pit-offset=2

Example 3: live pipeline

For live production pipelines, we recommend setting the known time to be the import run time. Structure your CSV as in Example 2:

entity_id,date,sales_actual,sales_estimate
merchant_001,2020-10-01,100000,50000
merchant_002,2020-05-31,,20000
python -m exabel.scripts.load_time_series_from_file \
    --api-key="my_api_key" \
    --filename="time_series.csv" \
    --entity-type="my_namespace.merchant" \
    --pit-current-time

The known time that is set will be the time on the Exabel system when the data is inserted. (Your local clock does not affect this.)

👍

Best practices for live data pipelines

  • Use --pit-current-time.
  • Import only new or updated data points — this will make the import run much more quickly as there will be fewer data points to process.

If you are unable to filter out historical data points that are unchanged, this is ok — the Exabel platform accepts imports of full historical time series, even if many of the data points already exist.

Long-form vs wide-form

The examples above use wide-form data — multiple signals are arranged in columns, aligned to the entities and dates in each row. This is useful and more compact when your signal data is dense and signals share the same frequency, so all signals have data points for every entity-and-date combination.

If your signals have differing frequencies (eg weekly and monthly signals), or if you want to import null values to override values with an earlier known-time, use long-form instead.

👍

Long-form data is automatically detected

If your data is structured as per the example below, with entities in the first column, and date, known_time (optional), signal and value columns, the command-line script will process your file as long-form data.

The same data from example #1 above in long-form:

entity_id,date,known_time,signal,value
merchant_001,2020-10-01,2020-11-02,sales_actual,100000
merchant_001,2020-10-01,2020-11-02,sales_estimate,50000
merchant_002,2020-05-31,2020-06-02,sales_estimate,20000
  • Column 1 must still be the entity column
  • Column 2 must still be the date column
  • The signal and value columns are required
  • The known_time column shown here is optional

Importing null values

If your data contains null, unknown, missing or deleted values, these can be uploaded using long-form CSVs.

🚧

Differences in null handling: long-form vs wide-form

In long-form data, the Exabel command-line scripts accept null values, which may override existing values.

In wide-form data, the Exabel command-line scripts ignore null values in the CSV, and do not import these into Exabel.

📘

Example long-form CSV with null values

A sales estimate exists on May 1 as the mean of two brokers' estimates. On May 2 no brokers are reporting sales estimate for the company, represented as a blank value. On May 3 a valid sales estimate reappears.

entity_id,date,known_time,signal,value
merchant_002,2020-05-31,2020-05-01,consensus_sales_estimate,20000
merchant_002,2020-05-31,2020-05-02,consensus_sales_estimate,
merchant_002,2020-05-31,2020-05-03,consensus_sales_estimate,19000
merchant_002,2020-05-31,2020-05-01,number_of_brokers,2
merchant_002,2020-05-31,2020-05-02,number_of_brokers,0
merchant_002,2020-05-31,2020-05-03,number_of_brokers,1

Importing time series on the global entity

It is also possible to connect time series to the "global" entity when uploading time series. No entity column is necessary in the input file but you must specify --entity-type="global".

date;global_signal
2021-01-01;1
2021-01-02;2
python -m exabel.scripts.load_time_series_from_file \
    --api-key="my_api_key" \
    --filename="time_series.csv" \
    --entity-type="global" \
    --pit-current-time
👍

Best practice for global time series

Think carefully before loading time series to the global entity — are the raw signals truly global for your use case? If you find yourself loading multiple signals like sales_europe, sales_americas, sales_africa and so on, you should probably introduce a region entity type with regions (Europe, Americas, Africa) as entities, and load the time series to a sales signal on these entities instead. See data modelling for more examples.