Preparing Data

Transforming your data into the right format

For most customers, data preparation involves the following:

  1. Transformation & Aggregation: getting your data into a time series, aggregated at the right levels

  2. Naming conventions: naming your entities, relationships & signals for easy accessibility

  3. Time alignment: aligning data points to the correct times.

  4. Point-in-time data: specifying the correct known times to enable accurate point-in-time analysis

Transformation & Aggregation

The Exabel Platform supports time series data of daily resolution or lower (weekly, monthly, etc.) Unstructured or observation-level data (eg individual transactions) must be aggregated to time series before import.

Exabel provides a powerful signal DSL for transforming and aggregating time series data in the platform, so it is preferable to import "raw" time series for greater flexibility when running analysis.

πŸ‘

Best practice: transformation

Import time series in a "raw" form. Transformations such as YoY change, moving averages, etc. can be done in Exabel with the signal DSL.

Example: import raw data such as daily card spend, sentiment scores and number of job postings. There is no need to do further transformations on such data before import.

πŸ‘

Best practice: temporal aggregation

Import time series with daily resolution, or the highest possible resolution you have. This allows for more flexibility in analysing the data. Exabel's signal DSL has many functions to align and aggregate data to company-specific fiscal periods, when needed downstream.

Example: import daily card spend data, and aggregate to fiscal quarters in Exabel.

πŸ‘

Best practice: entity-level aggregation

Aggregate data to the lowest entity level that you might possibly be interested in analysing, but no lower. Exabel's signal DSL allows you to aggregate time series across multiple entities into the parent entity level.

Example: you have data on Apple's sales by product (iPhone, iPad, etc), with store-level data for each product. If you are not interested in store-level analysis, aggregate store-level data to product-level. Later in the Exabel platform, you can still aggregate product-level data up to the parent entity (Apple).

Optional: if you are aggregating data for entities below a company, also do a separate aggregation to company-level, as these are commonly used for prediction models, portfolio strategies and dashboards.

Naming conventions

You are able to provide names for the data objects you import (entities, relationships, signals).

πŸ‘

Best practice: naming

Entities: entity names are case sensitive so make sure that variations are normalised when names refer to the same entity. This can be achieved by for example lowercasing all variations.

Example: if your data contains brand names with variations like Adidas and ADIDAS you could lowercase them to adidas

Signals: use a name which combines "key aspects" for immediate recognition by a user. For example, a good signal name like acmedata_unique_customers_index_m_afp includes:

  • Source/data set (eg acmedata)
  • Metric (eg unique_customers_index)
  • Frequency (eg d, m, q, a)
  • Alignment (eg afp - actual fiscal period, rd - reporting date)

Time alignment

In Exabel, each data point in a time series is defined by a single timestamp.

However, if your time series data is for periods that are longer than a day, you should set the timestamp to the period end. For example, if you have a data point for January 2021, set the timestamp as 2021-01-31T00:00:00Z.

πŸ‘

Best practice: time alignment

Time: set timestamps as period-end dates. Always provide timestamps as ISO-8601 datetimes, normalized to midnight UTC (eg 2021-01-01T00:00:00Z).

Point-in-time data

All time series on the Exabel platform are stored with point-in-time information. For each data point, we store both the value, the actual date the value refers to, and the known time of the data point, which is the date at which this value became known.

When uploading data, you must specify what this known time is for each time series data point. This is relevant for point-in-time analysis, as Exabel will use this to avoid look-ahead bias when using your data in features such as prediction models and alpha tests.

Example: for a data point of card spend on 1 January 2021, you may specify a separate known time of 3 January 2021 if the data point was only known 2 days later.

In addition, Exabel supports uploading time series with multiple data points for the same date, but with different known times. This is relevant for data with initial estimates that get revised over time.

Example: your card spend data point for 1 January 2021 may have an initial estimated value on 3 January 2021, and a final value on 5 January 2021. You can and should import both data points, both with a time of 1 January 2021, and with different known times.

πŸ‘

Best practice: point-in-time

Use the Exabel SDK to easily specify known times.

Always specify the known times while importing time series data.

Historical imports

  • If the data contains multiple known times for a given actual date, specify the known time for each data point explicitly (as a separate column in the CSV file, if using the SDK). This applies to data with trickle-in effects or restatements.
  • If the data is usually available with a fixed lag, specify a point-in-time offset (eg 1 day or 6 days) to set all known times in bulk to be the actual date plus the given delay. If using the SDK, this is done with a command line parameter.

Live / ongoing imports

Set the known time to be the time at which data is being loaded.

🚧

Default known time is the time of data import

If you use the Exabel Data API or SDK and do not specify a known time, it is set by default to be the time of the data import. For historical data imports, this is likely incorrect!