A Helper for zipline data bundles

Data bundles in zipline feed trading strategies with price data during backtesting. Zipline comes with some data bundles including the one that downloads price data from quandl’s wiki dataset. However, it is often necessary to employ custom data bundles, for example to fetch price of assets not provided by the existing data bundles. Adding such bundles to zipline can be error-prone and tricky. It requires several steps to be done from reading price data to preprocessing and feeding them into zipline internal database. The latter step is analogous across all data bundles. It thus makes sense to simplify the whole process by focusing only on price data fetching and preprocessing and letting the rest to be done by a generic ingester. I recently tried to implement such a generic ingester.

Generic Ingester

Zipline requires custom data bundles to implement ingest function properly. The function basically writes price data and symbol information to zipline internal database. A generic ingest function, on the one hand, implements operations that are identical for every data bundle, and on the other hand, enables customization of operations that are specific to a data bundle. It can ingest price data from a bunch of csv files with specific format or can directly download them via the API provided by price data provider.

I have used the generic ingester to implement data bundles that

reads price data from a directory containing csv files downloaded from yahoo finance,
directly download price data from yahoo finance, thanks to yahoofinancials,
directly download price data from IEX cloud, thanks to iexfinance.

Looking into their implementation greatly helps to understand how to use the generic ingester. Later in this post, however, I explain how to define a new bundle using the generic ingester in detail.

Installation

There is a quick way to use or test the already mentioned data bundles with zipline. The first step is to get the source from github:

git clone https://github.com/hhatefi/zipline_bundles

The repository comes with an installation script, which can be used to add the data bundles to the zipline framework. Before using it, I assume there is already an environment with zipline installed and it is ready to be used. The installation is done by

cd zipline_bundles
python3 installer.py

The installer copies the following files:

extension.py to $HOME/.zipline,
ingester.py, yahoo.py and iex.py into zipline.data.bundles package.

Note that the installer complains if there already exist python modules with the same name in the destination directories. To force the installer to overwrite the existing modules, add -f.

In general the installer copies all files listed in variable src_ext into $HOME/.zipline and those listed in src_ing into zipline.data.bundles package. In case a new bundle is added, the appropriate modules can be usually appended to src_ing list. The installation script can then be used to install the new bundle.

Usage

The available bundles are listed by

zipline bundles

If the installation is done successfully, it will show new bundles yahoo_csv, yahoo_direct and iex.

yahoo_csv bundle takes data from csv files downloaded from yahoo finance. Each file contains price data of a single asset and shall be named as assert_name.csv. The bundle reads all the csv files located in a directory given by environment variable YAHOO_CSVDIR:

YAHOO_CSVDIR=/path/to/csvdir zipline ingest -b yahoo_csv

yahoo_direct directly downloads price data from yahoo finance. The bundle extracts asset names from environment variable YAHOO_SYM_LST, which holds a comma separated list of asset names, for example:

YAHOO_SYM_LST=SPY,AAPL zipline ingest -b yahoo_direct

gets price data of assets SPY and AAPL. The start and end date of price data ingestion can be set into variables start_date and end_date, respectively. The variables are passed to function get_downloader where the bundle is registered in $HOME/.zipline/extension.py. More information comes next.

iex downloads price data from IEX cloud. Its usage is fairly similar to that of yahoo_direct. Fetching price data from IEX cloud however requires passing a valid API token, which is stored in environment variable IEX_TOKEN and read by iexfinance package. Moreover, the environment variable storing asset names is called IEX_SYM_LST.

Defining new bundles

In zipline a new bundle, which implements the ingest function, must be registered in the extension module extension.py, usually found in $HOME/.zipline/. Here, I explain how to implement an ingest function and how to register it inside the extension module. I start with csv data bundles.

New CSV data bundle

This bundle aims to read csv files from a location, to store them into pandas.DataFrame objects, to preprocess and feed them into zipline internal database. Most of the time for csv files, processing the column name is the only thing needs to be done. We also need to specify where the csv files are located. The registration of a csv data bundle looks as follows:

from zipline.data.bundles import register
from zipline.data.bundles.ingester import csv_ingester
register(
    'yahoo_csv',
    csv_ingester('YAHOO',
		 every_min_bar=False, # the price is daily
		 csvdir_env='YAHOO_CSVDIR',
		 csvdir='/path/to/csv/dir',
		 index_column='Date',
		 column_mapper={'Open': 'open',
				'High': 'high',
				'Low': 'low',
				'Close': 'close',
				'Volume': 'volume',
				'Adj Close': 'price',
		 },
    ),
    calendar_name='NYSE',
)

As mentioned before, the registration is done in $HOME/.zipline/extension.py. The ingest function is defined by creating an object of type csv_ingester, which is a functor. The parameters are as follows:

'YAHOO' is an arbitrary name for the exchange providing data.
every_min_bar indicates the price frequency. When it is true, the prices in csv files are supposed to be reported per minute. Otherwise they are expected to be stored daily.
csvdir_env is the name of the environment variable holding csv directory. It can be set, for instance, while ingesting price data:
```
YAHOO_CSV=/path/to/csvdir zipline ingest -b yahoo_csv
```
Zipline then searches for csv files inside /path/to/csvdir. The data bundle extracts the asset names from the filename by striping csv extension from the filename. For example, it considers AAPL.csv to store price data of Apple stock.
csvdir is the default csv directory that is used in case the environment variable is not set to a valid csv directory.
index_column is the column name inside csv file that stores time and date information. The bundle reads csv files into pandas.DataFrame objects with the index set to the given column.
column_mapper is a dictionary to be used for renaming data columns to comply with OHLCV format, as expected by zipline. As said earlier, price data are stored in dataframe objects, whose columns are identical to the corresponding columns in csv files. Renaming is necessary if the csv files do not respect OHLCV format.

There are two other parameters passed to the register function. 'yahoo_csv' is the bundle name and calender_name is the trading calendar on which the date and time of prices are based.

New direct data bundle

This type of bundle directly downloads price data via the API provided by a data provider. The downloader function is responsible to fetch price data and deliver it to the ingester. The ingester then feeds the data into zipline internal database. Thus, the main step is to define the downloader. Similar to a csv ingester, a direct ingester needs to be registered before being used by zipline. As an example, I explain, step by step, how a bundle capable of fetching data from IEX cloud can be registered and defined.

At first, a downloader function is required to download price data via IEX cloud API. The downloader is invoked by the ingester with appropriate parameters. It needs therefore to provide a specific signature.

from pandas import Timestamp
from iexfinance.stocks import get_historical_data

def get_downloader(start_date,
	       end_date,):
    """returns a downloader closure for iex cloud
    :param start_date: the first day on which dat are downloaded
    :param end_date: the last day on which data are downloaded
    :type start_date: str in format YYYY-MM-DD
    :type end_date: str in format YYYY-MM-DD
    """
    dt_start=Timestamp(start_date).date()
    dt_end=Timestamp(end_date).date()

    def downloader(symbol):
	"""downloads symbol price data using iex cloud API
	:param symbol: the symbol name
	:type symbol: str
	"""
	df = get_historical_data(symbol, dt_start, dt_end, output_format='pandas')

	return df

    return downloader

The downloader is generated by function get_downloader as a closure. This function takes the date interval via arguments start_date and end_date, within which price data are downloaded. The downloader takes the symbol name as the argument and fetches price data by calling get_historical_data. get_historical_data provided by package iexfinance handles relevant REST API calls to fetch data and then converts and returns them as a pandas.DataFrame. The return value must in addition comply with OHLCV format. Assume the above code block is stored as iex.py within zipline.data.bundles package, the next step is to register a new data bundle, which uses the downloader to fetch price data from IEX cloud. The registration is done within extension.py.

from zipline.data.bundles.ingester import direct_ingester
from zipline.data.bundles import iex
register('iex', # bundle's name
	 direct_ingester('IEX Cloud',
			 every_min_bar=False,
			 symbol_list_env='IEX_SYM_LST', # the environemnt variable holding the comma separated list of asset names
			 downloader=iex.get_downloader(start_date='2015-01-01',
						   end_date='2020-01-01'
			 ),
	 ),
	 calendar_name='NYSE',
)

The bundles is called iex and similar to yahoo_csv uses NYSE trading calendar. The ingest function is defined by creating an object of type direct_ingester, which is a functor. The parameters are as follows:

'IEX Cloud' is an arbitrary name for the exchange providing data.
every_min_bar indicates the price frequency. When it is true, the prices are supposed to be reported per minute. Otherwise they are daily prices.
symbol_list_env is the name of the environment variable holding a comma separated list of asset names. It can be set, for instance, while ingesting price data:
```
IEX_SYM_LST=SPY,AAPL,TWTR zipline ingest -b iex
```
Zipline then download data for assets SPY, AAPL and TWTR.
downloader is the downloader function, which in this case given by iex.get_downloader, defined above. Price data are downloaded between given start_date and end_date.

Conclusion

This helper aims to simplify the process of defining new data bundles regardless of reading data from csv files or directly downloading them via network. New data bundles can be added by customizing the generic ingester. The user can only focus on data retrival and filtering and let the other tasks done by the helper module.