Importing via Exabel SDK
Import data in production pipelines
The Exabel SDK provides an easy way to programmatically import data in bulk, and is recommended for use in production data pipelines. Under the hood, it uses the Data API.
This page describes how to use the Exabel SDK on the command line to import entities, relationships, signals and time series.
Requires SDK version ≥4.0
The instructions on this page require use of the Exabel SDK version ≥4.0.
Installation
pip install exabel-data-sdk
Or download from PyPI. The SDK requires Python 3.7 or later.
See the installation instructions on GitHub for other configuration options when installing the SDK.
Command line use
We recommend using the SDK through the command line - click here for a full list of scripts.
In general, the scripts may be run from the command line as follows:
python -m exabel_data_sdk.scripts.<script> <arguments>
The 3 main import scripts described below operate on CSV files, and require the following arguments:
--api-key
: your customer-level API key. Alternatively, set theEXABEL_API_KEY
environment variable.--filename
: path to the CSV file to import
The following are optional arguments:
--help
: print the help message with a full list of arguments--sep
: CSV delimiter (defaults to comma,
)--dry-run
: if True, prints to the command line without making any API calls--threads
: the number of parallel import threads (defaults to 40, but may be 1-100)
Technical concepts
You may want to recap Exabel's key concepts, particularly entities, relationships, signals and time series.
Resource name
Entities, relationships and signals are referenced by a resource name that uniquely identifies each resource. When importing these, the Exabel SDK requires you to provide unique names that will be used as resource names.
Exabel SDK command-line scripts normalize imported names
The Exabel SDK command-line scripts "normalize" your imported names to match resource name conventions, by removing any spaces and invalid characters.
Valid resource names are:
- Entity names can contain letters (upper/lowercase), numbers, hyphens & underscores; cannot start with a hyphen; and can be up to 64 characters long.
Eg:skoda
(regex:\w[\w-]{0,63}
)- Relationship names must start with an uppercase letter; can contain uppercase letters, numbers, and underscores; and can be up to 64 characters long.
Eg:HAS_BRAND
(regex:[A-Z][A-Z0-9_]{0,63}
)- Signal names must start with a letter (upper/lowercase); can contain letters, numbers, and underscores; and can be up to 64 characters long.
Eg:sales_index
(regex:[a-zA-Z]\w{0,63}
)If your data files contain names with invalid characters, the command-line scripts will:
- Replace all invalid characters with an underscore (
_
)- Shorten multiple sequential underscores to a single underscore
- Truncate names to be at most 64 characters
Note 1: if you are using the Python client from the Exabel SDK (not recommended, and not described on this page), or the Exabel Data API directly, then you will need do this normalization yourself.
Note 2: the automatic truncation of long resource names to 64 characters may result in multiple resources having the same name - we are unable to check for this in the command-line scripts. Exabel recommends that you normalize the names yourself if this is a possibility.
While the Exabel platform's canonical resource names include your namespace (eg customer1.skoda
), when using most command-line scripts, the namespace is retrieved automatically from the Data API, and the namespace should not be provided within your data.
Best practice: normalize resource names before data import
Entity resource names should be stable and consistent identifiers over time. If an entity changes display name, this should not result in a new entity being created in the data model. It is therefore important to be mindful when choosing a field as the entity column in your data source.
Normalizing your names into resource names before data import gives you full control and visibility into the actual identifiers used, and allowing you to ensure that the resource names are unique.
For example, see Importing entities - Example 2
Display name & description
Entities and signals may optionally be imported with more human-readable display names.
Entities, signals and relationships may also optionally be imported with descriptions.
Both display names and descriptions are used in the Exabel user interface where possible.
Check your identifier coverage
Exabel supports a range of identifiers with which you may map your data to companies. Not all identifiers will map to a company and sometimes multiple identifiers map to the same company.
To identify such cases Exabel provides a script to check the validity of your identifiers.
Run the following:
python -m exabel_data_sdk.scripts.check_company_identifiers_in_csv \
--api-key="my_api_key"
--filename="my_identifiers.csv" \
--identifier-column="factset_identifier"
Arguments
--filename
: a file containing a column with identifiers--identifier-column
: column for identifiers supported by Exabel
Create entity types
If you are importing entities that do not belong to the list of pre-defined global entity types, you will need to create the custom entity types first. You can do this by running the create_entity_type
command line script in the Exabel SDK. You will need to specify if the entity type is associative or not.
Run the following:
python -m exabel_data_sdk.scripts.create_entity_type \
--api-key="my_api_key" \
--name="entityTypes/my_namespace.my_entity_type" \
--display-name="My Entity Type" \
--description="This is my entity type" \
[ --is-associative | --no-is-associative ]
Arguments
--name
: the resource name of the new entity type--display-name
: the display name of the new entity type--description
: text description of the new entity type--is-associative | --no-is-associative
: either the entity type is associative or not
Create relationship types
You will need to create the relationship types in your model relationship types by running the create_relationship_type
command line script in the Exabel SDK. You will need to make a decision on whether the relationship should be an ownership relationship or not (Ownership relationships).
Run the following
python -m exabel_data_sdk.scripts.create_relationship_type \
--api-key="my_api_key" \
--name="relationshipType/my_namespace.my_relationship_type" \
--description="This is my relationship type" \
[ --is-ownership | --no-is-ownership ]
Arguments
--name
: the resource name of the new relationship type--description
: text description of the new relationship type--is-ownership | --no-is-ownership
: either the relationship type is an ownership relationship or not
Importing entities
Import entities by running the load_entities_from_csv
command line script in the Exabel SDK.
Arguments
--entity-type
: (optional) type of entities being imported; if unspecified, defaults to the value ofname_column
--name-column
: (optional) column for entity resource names; if unspecified, defaults to the first column--display-name-column
: (optional) column for entity display names; if unspecified, defaults to the second column if present, otherwise the resource name--description-column
: (optional) column for entity descriptions--upsert
: (optional, default False) if True, updates entities if they already exist
Example 1
The simplest example has a CSV of entities with a single column, where the column name (brand
) corresponds to the entity type being imported, relying on the Exabel SDK to normalize brand
names into resource names:
entity_id
Spring & Vine
The Coconut Tree
Run the following:
python -m exabel_data_sdk.scripts.load_entities_from_csv \
--api-key="my_api_key" \
--filename="entities.csv" \
--entity-type="my_namespace.merchant"
Example 2
This example structures the CSV with 3 columns specifying the normalized resource name (id
), display name (display_name
), and description (description
) explicitly.
id,display_name,description
merchant_001,Spring & Vine,Shampoo bars
merchant_002,The Coconut Tree,Sri Lankan street food
Run the following:
python -m exabel_data_sdk.scripts.load_entities_from_csv \
--api-key="my_api_key" \
--filename="entities.csv" \
--entity-type="my_namespace.merchant" \
--description-column="brand_description"
Importing relationships
Import relationships by running the load_relationships_from_csv
command line script in the Exabel SDK.
Arguments
--relationship-type
: type of relationships being imported--from-entity-column
: (optional) column for the entity from which the relationship originates; if unspecified, defaults to the first column in the file--from-entity-type
: (optional) entity type of the entity from which the relationship originates; if unspecified, the entity type will be inferred from the column name--from-identifier-type
: (optional) the identifier type used to map the entity from which the relationship originates. If--from-entity-type
is specified, this argument must be set in order to look up entities by an identifier--to-entity-column
: (optional) column for the entity to which the relationship goes; if unspecified, defaults to the second column in the file--to-entity-type
: (optional) entity type of the entity to which the relationship goes; if unspecified, the entity type will be inferred from the column name--to-identifier-type
: (optional) the identifier type used to map the entity to which the relationship goes. If--to-entity-type
is specified, this argument must be set in order to look up entities by an identifier--description-column
: (optional) column for relationship descriptions--upsert
: (optional, default False) if True, updates relationships if they already exist
Example 1: link to companies
A common use case is to create relationships that link your own custom entities to pre-loaded company entities (from Factset) available on the Exabel platform. (For more detail - see Mapping to Companies & Securities.)
You may reference pre-loaded companies by using these identifiers as column headers in your CSV:
bloomberg_ticker
orbloomberg_symbol
(egAAPL US
orAAPL US Equity
for Apple Inc.)mic:ticker
(egXNAS:AAPL
for Apple Inc.)figi
(egBBG001S5N8V8
for Apple Inc.)isin
(egUS0378331005
for Apple Inc.)factset_identifier
(eg000C7F-E
for Apple Inc.)
Exabel will map these identifiers to the correct company
entity, and create the relationship between your entity and the identified company
.
This example uses Factset IDs to identify the company entities. Structure your CSV to include 2 columns, with 1 row of data for each relationship, and use the same entity names from the entity import:
factset_identifier,merchant
00001-E,Spring & Vine
00002-E,The Coconut Tree
The relationship type and direction of the relationship are specified through command line arguments to the import script, which can be run as follows:
python -m exabel_data_sdk.scripts.load_relationships_from_csv \
--api-key="my_api_key" \
--filename="relationships.csv" \
--relationship-type="my_namespace.HAS_MERCHANT" \
--from-entity-type="company" \
--from-identifier-type="factset_identifier" \
--to-entity-type="my_namespace.merchant"
Relationships must reference the same entity resource names previously imported
If you are importing your own custom entities, it is important that you use the same entity resource names when importing both entities and relationships.
If you chose to import un-normalized entity names and let the Exabel SDK handle normalization (Entities example 1 above), you should use the same un-normalized entity names in your relationships data file.
Example 2: link your entities
Another common use case is to create relationships linking your own custom entities. The process is similar - structure your CSV with 2 columns for the "from" and "to" entities:
from,to
merchant_001,merchant_001-NL
merchant_002,merchant_002-TH
And then run the following:
python -m exabel_data_sdk.scripts.load_relationships_from_csv \
--api-key="my_api_key" \
--filename="relationships.csv" \
--relationship-type="my_namespace.LOCATED_IN" \
--from-entity-type="my_namespace.merchant" \
--to-entity-type="my_namespace.merchant_and_country"
Importing signals & time series
Import time series by running the load_time_series_from_csv
command line script in the Exabel SDK. In your data, you must specify which signal each imported time series belongs to.
You can ask the command line script to create signals for you using the --create-missing-signals
argument (recommended), or create these signals yourself (with the Exabel SDK Python client, or via the Data API).
Arguments
--entity-type
: (optional) entity type of the entities; if unspecified, the entity type will be inferred from the entity column name--identifier-type
: (optional) the identifier type used to map the entities. If--entity-type
is specified, this argument must be set in order to look up entities by an identifier--create-missing-signals
: (optional, default False) if True, automatically creates signals that do not already exist--pit-current-time
: (optional, default False) if True, sets the known time for all imported data points to be the time at which it is inserted into the Exabel platform--pit-offset
: (optional) sets the known time to an offset of 0-30 days, applied to the date of each data point.
Point-in-time data
When importing time series, you must specify the known time at which each data point was known. This is important and used to avoid look-ahead bias when analyzing data.
Learn more about Exabel's native point-in-time support here.
There are 3 options for specifying known times:
- Example 1: Provide data with an additional
known_time
column (recommended for historical backloads) - Example 2: Specify an offset of 0-30 days, applied to the date of each data point (recommended for historical backloads)
- Example 3: Set the known time to be the import run time (recommended for live production pipelines)
Example 1: backload with specific known times
In this example, you provide a specific known time for every data point. This provides maximum flexibility to handle data which may have been revised over time at an irregular cadence.
Structure your CSV as follows:
entity_id,date,known_time,sales_actual,sales_estimate
merchant_001,2020-10-01,2020-11-02,100000,50000
merchant_002,2020-05-31,2020-06-02,,20000
- Column 1: entity column, with entity resource names as values
- Column 2:
date
column, with ISO-8601 dates (not datetimes). The column header must bedate
. - Column 3:
known_time
column, with ISO-8601 dates. The column header must beknown_time
. - Columns 4, ...: one column for each signal that you import, with time series data for the corresponding entity in each row. Values may be blank if there is no data for that entity-date-signal combination, as in the example.
Then run the following:
python -m exabel_data_sdk.scripts.load_time_series_from_file \
--api-key="my_api_key" \
--filename="time_series.csv" \
--entity-type="my_namespace.merchant"
Example 2: backload with point-in-time offset
In this example, you provide only the date
for each data point and a point-in-time offset, in units of days, to apply to each date. This is best suited to historical backloads where you may not have had the historical .
For instance, if we specify an offset of 2 days, for Spring & Vine below, the
sales_actual
data point (date of 2020-10-01) will have a known time of 2020-10-03.
Structure your CSV as in Example 1, but without a column for known time:
entity_id,date,sales_actual,sales_estimate
merchant_001,2020-10-01,100000,50000
merchant_002,2020-05-31,,20000
Then run the following to specify an offset of 2 days:
python -m exabel_data_sdk.scripts.load_time_series_from_file \
--api-key="my_api_key" \
--filename="time_series.csv" \
--entity-type="my_namespace.merchant" \
--pit-offset=2
Example 3: live pipeline
For live production pipelines, we recommend setting the known time to be the import run time. Structure your CSV as in Example 2:
entity_id,date,sales_actual,sales_estimate
merchant_001,2020-10-01,100000,50000
merchant_002,2020-05-31,,20000
Then run the following:
python -m exabel_data_sdk.scripts.load_time_series_from_file \
--api-key="my_api_key" \
--filename="time_series.csv" \
--entity-type="my_namespace.merchant" \
--pit-current-time
The known time that is set will be the time on the Exabel system when the data is inserted into our database. (Your system time will not affect this.)
Best practices for live data pipelines
- Use
--pit-current-time
- Import only new or updated data points - this will make the import run much more quickly as there will be fewer data points to process.
If you are unable to filter out historical data points that are unchanged, this is ok - the Exabel platform accept imports of full historical time series, even if many of the data points already exist.
Long-form vs wide-form
The examples above use data formatted in wide-form - multiple signals are arranged in columns, aligned to the entities and dates in each row. This is useful and more compact when your signal data is dense and have the same frequencies, such that all signals have data points for each entity-and-date combination.
If your signals have differing frequencies (eg weekly and monthly signals), or if you want to import null values to override values with an earlier known-time, we recommend that you import data in long-form.
Long-form data is automatically detected
If your data is structured as per the example below, with entities in the first column, and
date
,known_time
(optional),signal
andvalue
columns, the command-line script will process your file as long-form data.
The same data from example #1 above should be structured like this:
entity_id,date,known_time,signal,value
merchant_001,2020-10-01,2020-11-02,sales_actual,100000
merchant_001,2020-10-01,2020-11-02,sales_estimate,50000
merchant_002,2020-05-31,2020-06-02,sales_estimate,20000
- Column 1 must still be the entity column
- Column 2 must still be the
date
column - The
signal
andvalue
columns are required - The
known_time
column shown here is optional
Importing null values
If your data contains null, unknown, missing or deleted values, these can be uploaded using long-form CSVs.
Differences in null handling: long-form vs wide-form
When handling long-form data, the Exabel command-line scripts accept null values, which may override existing values.
When handling wide-form data, the Exabel command-line scripts ignore null values in the CSV, and does not import these into Exabel.
Example long-form CSV with null values
In the example file below, there exists a sales estimate on May 1 as the mean of two brokers' estimates. However on May 2, no brokers are reporting sales estimate for the company, and thus no valid sales estimate value exist on May 2. This is represented as a blank value in the CSV. On May 3 again, there is a valid sales estimate and a value reappears. See Point-in-time data for details about the use of known-time in this scenario.
entity_id,date,known_time,signal,value
merchant_002,2020-05-31,2020-05-01,consensus_sales_estimate,20000
merchant_002,2020-05-31,2020-05-02,consensus_sales_estimate,
merchant_002,2020-05-31,2020-05-03,consensus_sales_estimate,19000
merchant_002,2020-05-31,2020-05-01,number_of_brokers,2
merchant_002,2020-05-31,2020-05-02,number_of_brokers,0
merchant_002,2020-05-31,2020-05-03,number_of_brokers,1
Importing time series on the global entity
It is also possible to connect time series to the "global" entity when uploading time series. No entity column is necessary in the input file but it is required to specify the argument --entity-type="global"
. Example input file:
date;global_signal
2021-01-01;1
2021-01-02;2
Then run the following:
python -m exabel_data_sdk.scripts.load_time_series_from_file \
--api-key="my_api_key" \
--filename="time_series.csv" \
--entity-type="global" \
--pit-current-time
Best practice for global time series
Think carefully before loading time series to the global entity, are the raw signals truly global for your use case? If you find yourself loading multiple signals like
sales_europe
,sales_americas
,sales_africa
and so on, you should probably introduce a region entity type with regions (Europe, Americas, Africa) as entities, and load the time series to asales
signal on these entities instead. See data modelling for more examples.
Updated 3 months ago