Welcome to nowcast’s documentation!

class nowcast.data_config.TSConfig

Register and preprocess time series from arbitrary datasets.

This simplifies the process of combining datasets from different domains. The data is unified into a single configuration object which is then handled directly by time series models.

Main features:
  • Register each dataset so that TSConfig can automatically merge on timestamp. Arbitrary combinations of datasets/variables can be specified for modeling.

  • Conveniently add autoregressive (lag) terms for any variable from any dataset, with extension to the future as far as the data goes. Unlike the default pandas lag function which truncates the data at the last date, this allows prediction on the future (i.e. forecasting).

  • Simulate information lag at the variable level. In other words, rows can be shifted so that the prediction for each timestamp uses only what information would have been available at the forecast time.

Example

Suppose we have a target variable in the dataframe cdc, and a predictor dataset in the dataframe external. First register the data:

$ dc = TSConfig()
$ dc.register_target(cdc, 'Date', 'CDC')
$ dc.register_dataset(external, 'pred', 'Date', 'predictor')

Add lag terms of the target variable as autoregressive predictors:

$ dc.add_AR(range(1, 7), dataset='CDC', var_names='%ILI')

Call the stack method to combine the datasets:

$ dc.stack()

This enables combined dataframes (as (X, y) tuple) to be accessed:

$ dc.data
add_AR(terms, dataset, var_names='all')

Create an autoregressive (lagged) dataset as additional features.

These are stored using the name ‘AR_{dataset}’.

Parameters:
  • terms (list) – lag terms to consider, e.g. 1 means 1 period ago.

  • dataset (str) – Dataset containing the variables to be lagged

  • var_names (str or list) – variables to lag. ‘all’ to select all.

register_dataset(data, name, time_var, var_names=None, copy=True)

Register a predictor dataset with the TSConfig.

Parameters:
  • data (pd.DataFrame) – Must contain at least 2 columns: a timestamp column, and a variable column.

  • name (str) – Name to associate with the dataset

  • time_var (str) – Name of the timestamp column (will be renamed to Timestamp)

  • var_names (list) – Optional, list of columns to keep

  • copy (bool) – Make a copy of data before modifying. If set to false, the original dataframe will be altered (only recommend when memory is a constraint)

register_target(data, time_var, target_var=None, copy=True)

Register a target variable with the TSConfig.

Parameters:
  • data (pd.DataFrame) – Must contain at least 2 columns, a timestamp column, and a variable column.

  • time_var (str) – Name of the timestamp column (will be renamed to Timestamp)

  • target_var (str) – Name of the target variable column (if None, is inferred)

  • copy (bool) – Make a copy of data before modifying. If set to false, the original dataframe will be altered (only recommend when memory is a constraint)

set_delay(periods, datasets='all')

Specify information delays by dataset.

This simulates delays in receiving a data source. Here, one can specify which datasets are delayed.

Caution

Make sure whether AR lags based on the target variable should be delayed based on the situation. Setting datasets=’all’ will delay these lags as well, which may not be realistic for knowledge delays. This setting is similar to forecasting with an important difference: The training set will run up to the present, whereas in forecasting the training set is limited to the most recent available target value.

Because of the interaction with the delays and AR terms, set_delay methods must be called before add_AR.

Parameters:
  • periods (int) – number of time intervals of delay

  • datasets (str or list) – list of datasets to apply delay or ‘all’ to delay all including AR lags of the target variable.

stack(predictors='all', merge_type='outer', fill_na='ignore')

Merge datasets together into final modeling dataframe.

Parameters:
  • predictors ('all' or list) – All predictor datasets to use

  • merge_type (str) – pandas merge option. In general use ‘outer’.

  • fill_na (str) – How to handle missing data. Keep as ‘ignore’ for now

class nowcast.arex.Arex(model, X=None, y=None, data_config=None, verbose=1)

AREX (AutoRegression with EXogeneity) is an iterative time series predictor that abstracts away the logic of retraining a model sequentially on time series data. AREX does not impose any modeling constraints – instead it is a procedure that can handle any model that is compatible with scikit-learn’s fit/predict API.

Usually, one retrains a time series model at each time step in order to use the most recent information. The training set at each step can be either rolling (fixed size that discards old data), or expanding (use all data). Often, a time series is predicted using a combination of lags of the time series (AR), concurrent exogenous variables (EX), and lags of the exogenous variables. AREX takes care of these details for you.

On the other hand, the actual model that is applied at each time step is highly important to researchers – it can involve preprocessing and feature engineering to using various ML algorithms and hyperparameter tuning. Thus this part is flexible and only limited by your creativity. All AREX needs is a model class with fit() and predict() methods, identical to sklearn.

In fact, any sklearn model can be passed directly into AREX to get an out-of-the-box time series modeler.

Example

Suppose we continue the example with TSConfig:

$ dc = TSConfig()
$ dc.register_target(cdc, 'Date', 'CDC')
$ dc.register_dataset(external, 'pred', 'Date', 'predictor')
$ dc.add_AR(range(1, 7), dataset='CDC', var_names='%ILI')
$ dc.stack()
We will use a default sklearn random forest as the model::

$ mod = RandomForestRegressor()

The above time series is at a weekly frequency. For nowcasting (predicting target at week t using exogenous data from week t) with a year-long rolling training window, do:

$ arex = Arex(model=mod, data_config=dc)
$ pred = arex.nowcast(pred_start='2019-02-19',
                      pred_end='2019-08-20',
                      training='roll', window=52)

Suppose we want to predict a week ahead. We would run:

$ pred2 = arex.forecast(t_plus=1, pred_start='2019-02-19',
                        pred_end='2019-08-20',
                        training='roll', window=52)

Note that the timestamps for pred_start and pred_end refer to the time of making the prediction, not the time that is predicted.

The returned prediction dataframe will include a column called “Timestamp”, which is the time of making the prediction, and a column called “Event Time”, which is the time of the predicted event.

Parameters:
  • model (class) – Any model with fit and predict methods following sklearn API. The methods must accept pandas dataframes.

  • X (dataframe) – Predictor dataframe. Either pass dataframes for both X and y, or otherwise pass a TSConfig object to data_config.

  • y (dataframe) – Target dataframe with a single column with the target measurements, indexed by timestamp

  • data_config (TSConfig) – TSConfig object containing preprocessed X and y data using TSConfig. This will overwrite X and y if passed.

  • verbose (int) – Control verbosity of output

forecast(t_plus, pred_start, pred_end, training, window, pred_name='Predicted', t_known=False)

Perform rolling forecast on data.

Parameters:
  • t_plus (int) – Number of periods ahead to forecast. 0 corresponds to nowcasting (predict y_t from X_t).

  • pred_start (str or datetime) – Time of first prediction

  • pred_end (str or datetime) – Time of last prediction

  • training (str) – Training window behavior, either ‘roll’ or ‘expand’

  • window (int) – Training window size. If training is ‘expand’, then window determines the size for the first prediction, and subsequent windows increase in size.

  • pred_name (str) – Name for prediction output column

  • t_known (bool) – Whether y_t would be known at time of forecasting. This adjusts the training window to use the most recent information. Since y_t is the nowcast target, t_known must be set to False when t_plus=0.

get_log()

Return a dataframe of metadata for each prediction. For example, the X_train and y_train date ranges, prediction time, and training set sizes. Use this for debugging or to confirm behavior.

get_params()

Returns user-set parameters of Arex object

nowcast(pred_start, pred_end, training, window, pred_name='Predicted')

Perform rolling nowcast on data. This is just a convenience wrapper for the forecast method.

Examples of ARGO models corresponding to previous research:

These examples can be used out-of-the-box as an alternative to standard sklearn models. They demonstrate the variety of options that can be defined in an iterated time series model, including data preprocessing, transformations, and cross-validation.

Models should adhere to sklearn API as shown in the examples.

class nowcast.models.argo_models.Argo

The base ARGO model without extra features, with Lasso regression

class nowcast.models.argo_models.Argo2015

The ARGO model based on Yang et al. 2015

Currently not configured to log-transform Google Trends data. It takes the logit of all inputs.

class nowcast.models.argo_models.ArgoSVM

ARGO with SVM regressor, based on Santillana et al. 2016

Data loaders for various datasets used by ARGO:

Specifically: CDC ILINet data, athenahealth, and Google Trends

class nowcast.datasets.data_loaders.AthenaLoader(filename, smoothing=None)

Loader for CDC data.

Example usage:

$ athl = AthenaLoader("./ATHdata.csv")
$ ath = athl.load_national()
class nowcast.datasets.data_loaders.CDCLoader(filename, ili_version='weighted')

Loader for CDC data.

Example usage:

$ cdcl = CDCLoader("./ILI_national_dated.csv")
$ cdc = cdcl.load_national()
nowcast.datasets.data_loaders.gt_loader(filename)

Loader for Google Trends data

Indices and tables