A Helper for zipline data bundles
Data bundles in zipline feed trading strategies with price data during backtesting. Zipline comes with some data bundles including the one that downloads price data from quandl’s wiki dataset. However, it is often necessary to employ custom data bundles, for example to fetch price of assets not provided by the existing data bundles. Adding such bundles to zipline can be error-prone and tricky. It requires several steps to be done from reading price data to preprocessing and feeding them into zipline internal database. The latter step is analogous across all data bundles. It thus makes sense to simplify the whole process by focusing only on price data fetching and preprocessing and letting the rest to be done by a generic ingester. I recently tried to implement such a generic ingester.
Generic Ingester
Zipline requires custom data bundles to implement ingest
function
properly. The function basically writes price data and symbol
information to zipline internal database. A generic ingest
function, on the one hand, implements operations that are identical
for every data bundle, and on the other hand, enables customization of
operations that are specific to a data bundle. It can ingest price
data from a bunch of csv files with specific format or can directly
download them via the API provided by price data provider.
I have used the generic ingester to implement data bundles that
- reads price data from a directory containing csv files downloaded from yahoo finance,
- directly download price data from yahoo finance, thanks to yahoofinancials,
- directly download price data from IEX cloud, thanks to iexfinance.
Looking into their implementation greatly helps to understand how to use the generic ingester. Later in this post, however, I explain how to define a new bundle using the generic ingester in detail.
Installation
There is a quick way to use or test the already mentioned data bundles with zipline. The first step is to get the source from github:
git clone https://github.com/hhatefi/zipline_bundles
The repository comes with an installation script, which can be used to add the data bundles to the zipline framework. Before using it, I assume there is already an environment with zipline installed and it is ready to be used. The installation is done by
cd zipline_bundles
python3 installer.py
The installer copies the following files:
- extension.py to
$HOME/.zipline
, - ingester.py, yahoo.py and iex.py into
zipline.data.bundles
package.
Note that the installer complains if there already exist python
modules with the same name in the destination directories. To force
the installer to overwrite the existing modules, add -f
.
In general the installer copies all files listed in variable src_ext
into $HOME/.zipline
and those listed in src_ing
into
zipline.data.bundles
package. In case a new bundle is added, the
appropriate modules can be usually appended to src_ing
list. The
installation script can then be used to install the new bundle.
Usage
The available bundles are listed by
zipline bundles
If the installation is done successfully, it will show new bundles
yahoo_csv
, yahoo_direct
and iex
.
yahoo_csv
bundle takes data from csv files downloaded from yahoo
finance. Each file contains price data of a single asset and shall be
named as assert_name.csv
. The bundle reads all the csv files located
in a directory given by environment variable YAHOO_CSVDIR
:
YAHOO_CSVDIR=/path/to/csvdir zipline ingest -b yahoo_csv
yahoo_direct
directly downloads price data from yahoo finance. The
bundle extracts asset names from environment variable YAHOO_SYM_LST
,
which holds a comma separated list of asset names, for example:
YAHOO_SYM_LST=SPY,AAPL zipline ingest -b yahoo_direct
gets price data of assets SPY
and AAPL
. The start and end date of
price data ingestion can be set into variables start_date
and
end_date
, respectively. The variables are passed to function
get_downloader
where the bundle is registered in
$HOME/.zipline/extension.py
. More information comes next.
iex
downloads price data from IEX cloud. Its usage is fairly similar
to that of yahoo_direct
. Fetching price data from IEX cloud however
requires passing a valid API token, which is stored in environment
variable IEX_TOKEN
and read by iexfinance
package. Moreover, the
environment variable storing asset names is called IEX_SYM_LST
.
Defining new bundles
In zipline a new bundle, which implements the ingest
function, must
be registered in the extension module extension.py
, usually found in
$HOME/.zipline/
. Here, I explain how to implement an ingest
function and how to register it inside the extension module. I start
with csv data bundles.
New CSV data bundle
This bundle aims to read csv files from a location, to store them into
pandas.DataFrame
objects, to preprocess and feed them into zipline
internal database. Most of the time for csv files, processing the
column name is the only thing needs to be done. We also need to
specify where the csv files are located. The registration of a csv
data bundle looks as follows:
from zipline.data.bundles import register
from zipline.data.bundles.ingester import csv_ingester
register(
'yahoo_csv',
csv_ingester('YAHOO',
every_min_bar=False, # the price is daily
csvdir_env='YAHOO_CSVDIR',
csvdir='/path/to/csv/dir',
index_column='Date',
column_mapper={'Open': 'open',
'High': 'high',
'Low': 'low',
'Close': 'close',
'Volume': 'volume',
'Adj Close': 'price',
},
),
calendar_name='NYSE',
)
As mentioned before, the registration is done in
$HOME/.zipline/extension.py
. The ingest
function is defined by
creating an object of type csv_ingester
, which is a functor. The
parameters are as follows:
-
'YAHOO'
is an arbitrary name for the exchange providing data. -
every_min_bar
indicates the price frequency. When it istrue
, the prices in csv files are supposed to be reported per minute. Otherwise they are expected to be stored daily. -
csvdir_env
is the name of the environment variable holding csv directory. It can be set, for instance, while ingesting price data:YAHOO_CSV=/path/to/csvdir zipline ingest -b yahoo_csv
Zipline then searches for csv files inside
/path/to/csvdir
. The data bundle extracts the asset names from the filename by stripingcsv
extension from the filename. For example, it considersAAPL.csv
to store price data of Apple stock. -
csvdir
is the default csv directory that is used in case the environment variable is not set to a valid csv directory. -
index_column
is the column name inside csv file that stores time and date information. The bundle reads csv files intopandas.DataFrame
objects with the index set to the given column. -
column_mapper
is a dictionary to be used for renaming data columns to comply with OHLCV format, as expected by zipline. As said earlier, price data are stored in dataframe objects, whose columns are identical to the corresponding columns in csv files. Renaming is necessary if the csv files do not respect OHLCV format.
There are two other parameters passed to the register
function. 'yahoo_csv'
is the bundle name and calender_name
is the
trading calendar on which the date and time of prices are based.
New direct data bundle
This type of bundle directly downloads price data via the API provided by a data provider. The downloader function is responsible to fetch price data and deliver it to the ingester. The ingester then feeds the data into zipline internal database. Thus, the main step is to define the downloader. Similar to a csv ingester, a direct ingester needs to be registered before being used by zipline. As an example, I explain, step by step, how a bundle capable of fetching data from IEX cloud can be registered and defined.
At first, a downloader function is required to download price data via IEX cloud API. The downloader is invoked by the ingester with appropriate parameters. It needs therefore to provide a specific signature.
from pandas import Timestamp
from iexfinance.stocks import get_historical_data
def get_downloader(start_date,
end_date,):
"""returns a downloader closure for iex cloud
:param start_date: the first day on which dat are downloaded
:param end_date: the last day on which data are downloaded
:type start_date: str in format YYYY-MM-DD
:type end_date: str in format YYYY-MM-DD
"""
dt_start=Timestamp(start_date).date()
dt_end=Timestamp(end_date).date()
def downloader(symbol):
"""downloads symbol price data using iex cloud API
:param symbol: the symbol name
:type symbol: str
"""
df = get_historical_data(symbol, dt_start, dt_end, output_format='pandas')
return df
return downloader
The downloader is generated by function get_downloader
as a
closure. This function takes the date interval via arguments
start_date
and end_date
, within which price data are
downloaded. The downloader takes the symbol name as the argument and
fetches price data by calling
get_historical_data
. get_historical_data
provided by package
iexfinance handles relevant REST API calls to fetch data and then
converts and returns them as a pandas.DataFrame
. The return value
must in addition comply with OHLCV format. Assume the above code block
is stored as iex.py
within zipline.data.bundles
package, the next
step is to register a new data bundle, which uses the downloader to
fetch price data from IEX cloud. The registration is done within
extension.py
.
from zipline.data.bundles.ingester import direct_ingester
from zipline.data.bundles import iex
register('iex', # bundle's name
direct_ingester('IEX Cloud',
every_min_bar=False,
symbol_list_env='IEX_SYM_LST', # the environemnt variable holding the comma separated list of asset names
downloader=iex.get_downloader(start_date='2015-01-01',
end_date='2020-01-01'
),
),
calendar_name='NYSE',
)
The bundles is called iex
and similar to yahoo_csv
uses NYSE
trading calendar. The ingest
function is defined by
creating an object of type direct_ingester
, which is a functor. The
parameters are as follows:
-
'IEX Cloud'
is an arbitrary name for the exchange providing data. -
every_min_bar
indicates the price frequency. When it istrue
, the prices are supposed to be reported per minute. Otherwise they are daily prices. -
symbol_list_env
is the name of the environment variable holding a comma separated list of asset names. It can be set, for instance, while ingesting price data:IEX_SYM_LST=SPY,AAPL,TWTR zipline ingest -b iex
Zipline then download data for assets
SPY
,AAPL
andTWTR
. -
downloader
is the downloader function, which in this case given byiex.get_downloader
, defined above. Price data are downloaded between givenstart_date
andend_date
.
Conclusion
This helper aims to simplify the process of defining new data bundles regardless of reading data from csv files or directly downloading them via network. New data bundles can be added by customizing the generic ingester. The user can only focus on data retrival and filtering and let the other tasks done by the helper module.