Datasets
Defining Data Models
Part 1/ Set up SDK
Follow these steps to create a repository for your dataset:
- Open the Lean.DataSource.SDK repository and click .
- On the Create a new repository from Lean.DataSource.SDK page, set the repository name to Lean.DataSource.<vendorNameDatasetName> (for example, Lean.DataSource.XYZAirlineTicketSales).
- Click .
- Clone the Lean.DataSource.<vendorNameDatasetName> repository.
- If you're on a Linux terminal, in your Lean.DataSource.<vendorNameDatasetName> directory, change the access permissions of the bash script.
- In your Lean.DataSource.<vendorNameDatasetName> directory, run the renameDataset.sh bash script.
Start with the SDK repository instead of existing data source implementations because we periodically update the SDK repository.
If your dataset contains multiple series, use <vendorName> instead of <vendorNameDatasetName>. For instance, the Federal Reserve Economic Data (FRED) dataset repository has the name Lean.DataSource.FRED because it has many different series.
$ git clone https://github.com/username/Lean.DataSource.<vendorNameDatasetName>.git
$ chmod +x ./renameDataset
$ renameDataset.sh
The bash script replaces some placeholder text in the Lean.DataSource.<vendorNameDatasetName> directory and renames some files according to your dataset's <vendorNameDatasetName>.
Part 2/ Create Data Models
The input to your model should be one or many CSV files that are in chronological order.
1997-01-01,905.2,941.4,905.2,939.55,38948210,978.21 1997-01-02,941.95,944,925.05,927.05,49118380,1150.42 1997-01-03,924.3,932.6,919.55,931.65,35263845,866.74 ... 2014-07-24,7796.25,7835.65,7771.65,7830.6,117608370,6271.45 2014-07-25,7828.2,7840.95,7748.6,7790.45,153936037,7827.61 2014-07-28,7792.9,7799.9,7722.65,7748.7,116534670,6107.78
If you don't already have these CSV files, you'll create them later during the Rendering Data part of this tutorial series. For this part of the contribution process, consider using a "toy example" file to establish the format and requirements.
Follow these steps to define the data source class:
- Open the Lean.DataSource.<vendorNameDatasetName> / <vendorNameDatasetName>.cs file.
- Follow these steps to define the properties of your dataset:
- Duplicate lines 32-36 for as many properties as there are in your dataset.
- Rename the
SomeCustomProperty
properties to the names of your dataset properties (for example,Destination
). - If your dataset is a streaming dataset like the Benzinga News Feed, change the argument that is passed to the
ProtoMember
members so that they start at 10 and increment by one for each additional property in your dataset. - If your dataset isn't a streaming dataset, delete the
ProtoMember
property decorators. - Replace the “Some custom data property” comments with a description of each property in your dataset.
- If your dataset contains multiple series, like the FRED dataset, create a helper class file in Lean.DataSource.<vendorNameDatasetName> directory to map the series name to the series code. For a full example, see the LIBOR.cs file in the Lean.DataSource.FRED repository. The helper class makes it easier for members to subscribe to the series in your dataset because they don't need to know the series code. For instance, you can subscribe to the 1-Week London Interbank Offered Rate (LIBOR) based on U.S. Dollars with the following code snippet:
- Define the GetSource method to point to the path of your dataset file(s).
- Define the
Reader
reader
method to return instances of your dataset class. - Define the DataTimeZone method.
- Define the
SupportedResolutions
method. - Define the
DefaultResolution
method. - Define the
IsSparseData
method. - Define the RequiresMapping method.
- Define the
Clone
method. - Define the
ToString
method.
AddData<Fred>(Fred.LIBOR.OneWeekBasedOnUSD); // Instead of // AddData<Fred>("USD1WKD156N");
self.add_data(Fred, Fred.LIBOR.one_week_based_on_usd) # Instead of # self.add_data(Fred, "USD1WKD156N")
If your dataset is organized across multiple CSV files, use the config.Symbol.Value
string to build the file path. config.Symbol.Value
is the string value of the argument you pass to the AddData method when you subscribe to the dataset. An example output file path is / output / alternative / xyzairline / ticketsales / dal.csv.
Set Symbol = config.Symbol
and set EndTime
end_time
to the time that the datapoint first became available for consumption.
Your data class inherits from the BaseData
class, which has Value
and Time
time
properties. Set the Value
property to one of the factors in your dataset. If you don't set the Time
time
property, its default value is the value of EndTime
end_time
. For more information about the Time
time
and EndTime
end_time
properties, see Periods.
public class VendorNameDatasetName : BaseData { public override DateTimeZone DataTimeZone() { return DateTimeZone.Utc; } }
If you import using QuantConnect
, the TimeZones
class provides helper attributes to create DateTimeZone
objects. For example, you can use TimeZones.Utc
or TimeZones.NewYork
. For more information about time zones, see Time Zones.
public class VendorNameDatasetName : BaseData { public override List<Resolution> SupportedResolutions() { return DailyResolution; } }
The Resolution
enumeration has the following members:
If a member doesn't specify a resolution when they subscribe to your dataset, Lean uses the DefaultResolution
.
public class VendorNameDatasetName : BaseData { public override Resolution DefaultResolution() { return Resolution.Daily; } }
If your dataset is not tick resolution and your dataset is missing data for at least one sample, it's sparse. If your dataset is sparse, we disable logging for missing files.
public class VendorNameDatasetName : BaseData { public override bool IsSparseData() { return true; } }
public class VendorNameDatasetName : BaseData { public override bool RequiresMapping() { return true; } }
public class VendorNameDatasetName : BaseData { public override BaseData Clone() { return new VendorNameDatasetName { Symbol = Symbol, Time = Time, EndTime = EndTime, SomeCustomProperty = SomeCustomProperty, }; } }
public class VendorNameDatasetName : BaseData { public override string ToString() { return $"{Symbol} - {SomeCustomProperty}"; } }
Part 3/ Create Universe Models
If your dataset doesn't provide universe data, follow these steps:
- Delete the Lean.DataSource.<vendorNameDatasetName> / <vendorNameDatasetName>Universe.cs.
- Delete the Lean.DataSource.<vendorNameDatasetName> / <vendorNameDatasetName>UniverseSelectionAlgorithm.* files.
- In the Lean.DataSource.<vendorNameDatasetName> / tests / Tests.csproj file, delete the code on line 8 that compiles the universe selection algorithms.
- Skip the rest of this page.
The input to your model should be many CSV files where the first column is the security identifier and the second column is the point-in-time ticker.
A R735QTJ8XC9X,A,17.19,109700,1885743,False,0.9904858,1 AA R735QTJ8XC9X,AA,71.25,513400,36579750,False,0.3992678,0.750075 AAB R735QTJ8XC9X,AAB,16.38,5000,81900,False,0.9902758,1 ... ZSEV R735QTJ8XC9X,ZSEV,10.5,800,8400,False,0.8981684,1 ZTR R735QTJ8XC9X,ZTR,9.56,102300,977988,False,0.0803037,3.97015016 ZVX R735QTJ8XC9X,ZVX,10,15600,156000,False,1,0.666667
Follow these steps to define the data source class:
- Open the Lean.DataSource.<vendorNameDatasetName> / <vendorNameDatasetName>Universe.cs file.
- Follow these steps to define the properties of your dataset:
- Duplicate lines 33-36 or 38-41 (depending on the data type) for as many properties as there are in your dataset.
- Rename the
SomeCustomProperty
/SomeNumericProperty
properties to the names of your dataset properties (for example,Destination
/FlightPassengerCount
). - Replace the “Some custom data property” comments with a description of each property in your dataset.
- Define the GetSource method to point to the path of your dataset file(s).
- Define the
Reader
reader
method to return instances of your universe class. - Define the DataTimeZone method.
- Define the
SupportedResolutions
method. - Define the
DefaultResolution
method. - Define the
IsSparseData
method. - Define the RequiresMapping method.
- Define the
Clone
method. - Define the
ToString
method.
Use the date
parameter as the file name to get the DateTime
of data being requested. Example output file paths are / output / alternative / xyzairline / ticketsales / universe / 20200320.csv for daily data and / output / alternative / xyzairline / ticketsales / universe / 2020032000.csv for hourly data.
The first column in your data file must be the security identifier and the second column must be the point-in-time ticker. With this configuration, use new Symbol(SecurityIdentifier.Parse(csv[0]), csv[1])
to create the security Symbol
.
The date in your data file must be the date that the data point is available for consumption. With this configuration, set the Time
time
to date - Period
.
public class VendorNameDatasetNameUniverse : BaseData { public override DateTimeZone DataTimeZone() { return DateTimeZone.Utc; } }
If you import using QuantConnect
, the TimeZones
class provides helper attributes to create DateTimeZone
objects. For example, you can use TimeZones.Utc
or TimeZones.NewYork
. For more information about time zones, see Time Zones.
public class VendorNameDatasetNameUniverse : BaseData { public override List<Resolution> SupportedResolutions() { return DailyResolution; } }
Universe data must have hour or daily resolution.
The Resolution
enumeration has the following members:
If a member doesn't specify a resolution when they subscribe to your dataset, Lean uses the DefaultResolution
.
public class VendorNameDatasetNameUniverse : BaseData { public override Resolution DefaultResolution() { return Resolution.Daily; } }
If your dataset is not tick resolution and your dataset is missing data for at least one sample, it's sparse. If your dataset is sparse, we disable logging for missing files.
public class VendorNameDatasetNameUniverse : BaseData { public override bool IsSparseData() { return true; } }
public class VendorNameDatasetNameUniverse : BaseData { public override bool RequiresMapping() { return true; } }
public class VendorNameDatasetNameUniverse : BaseData { public override BaseData Clone() { return new VendorNameDatasetName { Symbol = Symbol, Time = Time, EndTime = EndTime, SomeCustomProperty = SomeCustomProperty, }; } }
public class VendorNameDatasetNameUniverse : BaseData { public override string ToString() { return $"{Symbol} - {SomeCustomProperty}"; } }