Dataset

In EpiLearn, we use UniversalDataset to load preprocessed datasets. For customized data, we can simply initialize the UniversalDataset given features, graphs, and states.

UniversalDataset

class epilearn.data.dataset.UniversalDataset(name=None, root='./', x=None, states=None, y=None, graph=None, dynamic_graph=None, edge_index=None, edge_weight=None, edge_attr=None)

UniversalDataset class is designed to handle various types of graph data, enabling operations on datasets that include features, states, dynamic graphs, and edge attributes.

Parameters:
  • name (str, optional) – Name of the deataset to be loaded (Supported dataset only).

  • root (str, optional) – Location of the dataset to be downloaded (Supported dataset only).

  • x (torch.Tensor, optional) – Node features tensor of shape (num_samples, num_nodes, num_features). Represents the node features over multiple timesteps.

  • states (torch.Tensor, optional) – Tensor representing various states of nodes, similar in structure to node features.

  • y (torch.Tensor, optional) – Tensor representing target labels or values for each node, structured similar to node features.

  • graph (torch.Tensor or scipy.sparse matrix, optional) – Static graph structure as an adjacency matrix.

  • dynamic_graph (torch.Tensor, optional) – Dynamic graph information over time, providing evolving adjacency matrices.

  • edge_index (torch.LongTensor, optional) – Tensor containing edge indices, typically of shape (2, num_edges), for defining which nodes are connected.

  • edge_weight (torch.Tensor, optional) – Edge weights corresponding to the edge_index, providing the strength or capacity of connections.

  • edge_attr (torch.Tensor, optional) – Attributes or features for each edge, aligned with the structure defined in edge_index.

classmethod from_csv(feature_csv: str, node_id_col: str, time_col: str, feature_cols: list, target_cols: list | None = None, edge_csv: str | None = None, source_col: str = 'source', target_col: str = 'target', strict_numeric: bool = True)

Load dataset from CSV files and build a UniversalDataset without changing existing behaviors.

Parameters:
  • feature_csv – Path to the CSV containing time series features/targets. Must include time_col, node_id_col, feature_cols, and optionally target_cols.

  • node_id_col – Column name for node identifiers.

  • time_col – Column name for timestamps (sorted ascending).

  • feature_cols – List of feature column names (numeric).

  • target_cols – Optional list of target column names (numeric).

  • edge_csv – Optional path to an edges CSV with two columns: source_col, target_col.

  • source_col – Column names for edges CSV.

  • target_col – Column names for edges CSV.

  • strict_numeric – If True, raise on any non-numeric entries in features/targets. If False, warn and keep NaNs.

Returns:

UniversalDataset(x=[T,N,F], y=[T,N] or [T,N,Ty], graph=[N,N], edge_index=[2,E])

generate_dataset(X=None, Y=None, states=None, dynamic_adj=None, lookback_window_size=1, horizon_size=1, ahead=0, permute=False)

Takes node features for the graph and divides them into multiple samples along the time-axis by sliding a window of size (num_timesteps_input+ num_timesteps_output) across it in steps of 1. :param X: Node features of shape (num_vertices, num_features, num_timesteps) :return:

  • Node features divided into multiple samples. Shape is

(num_samples, num_vertices, num_features, num_timesteps_input). - Node targets for the samples. Shape is (num_samples, num_vertices, num_features, num_timesteps_output).

classmethod load_from_csv(feature_csv: str, node_id_col: str, time_col: str, feature_cols: list, target_cols: list | None = None, edge_csv: str | None = None, source_col: str = 'source', target_col: str = 'target', strict_numeric: bool = True)

Load dataset from CSV files and build a UniversalDataset without changing existing behaviors.

Parameters:
  • feature_csv – Path to the CSV containing time series features/targets. Must include time_col, node_id_col, feature_cols, and optionally target_cols.

  • node_id_col – Column name for node identifiers.

  • time_col – Column name for timestamps (sorted ascending).

  • feature_cols – List of feature column names (numeric).

  • target_cols – Optional list of target column names (numeric).

  • edge_csv – Optional path to an edges CSV with two columns: source_col, target_col.

  • source_col – Column names for edges CSV.

  • target_col – Column names for edges CSV.

  • strict_numeric – If True, raise on any non-numeric entries in features/targets. If False, warn and keep NaNs.

Returns:

UniversalDataset(x=[T,N,F], y=[T,N] or [T,N,Ty], graph=[N,N], edge_index=[2,E])

Preprocessed Datasets

We collect epidemic data from various sources including the followings:

Temporal Data

  • Tycho_v1.0.0: Including eight diseases collected across 50 US states and 122 US cities from 1916 to 2009.

  • Measles: Contains measles infections in England and Wales across 954 urban centers (cities and towns) from 1944 to 1964.

Spatial&Temporal Data

  • Covid_static: Contains covid infections with static graph. [1]

  • Covid_dynamic: Contains covid infections with dynamic graph. [2] [3]

Dataset Loading

Loading Measle and Tycho Datasets:

from epilearn.data import UniversalDataset

tycho_dataset = UniversalDataset(name='Tycho_v1', root='./tmp/')

measle_dataset = UniversalDataset(name='Measles', root='./tmp/')

For covid data, we support the Dataset from Johns Hopkings University:

from epilearn.data import UniversalDataset

jhu_dataset = UniversalDataset(name='JHU_covid', root='./tmp/')

For other countries, please use ‘Covid_’+’country’ to acquire the correspnding covid dataset. Currently, we support countries like China, Brazil, Austria, England, France, Italy, Newzealand, and Spain.

from epilearn.data import UniversalDataset

covid_dataset = UniversalDataset(name='Covid_Brazil', root='./tmp/')

Customize Your Own Dataset

First, you should form your data as a dictionary with keys of features, graph, dynamic_graph, targets, and states. Here is an example:

data = torch.load("example.pt")

data.keys()
dict_keys(['features', 'graph', 'dynamic_graph', 'targets', 'states'])
node_features = data['features']    # [time steps, nodes, channels]: torch.Size([539, 47, 4])

static_graph = torch.Tensor(data['graph'])  # [nodes, nodes]: (47, 47)

dynamic_graph = data['dynamic_graph']   # [time steps, nodes, nodes]: torch.Size([539, 47, 47])

targets = data['targets']   # [time steps, nodes]: torch.Size([539, 47])

node_status = data['states']    # [time steps, nodes]: torch.Size([539, 47])

Next, you can use your own data to establish a UniversalDataset class by passing the correponding parameters due to your needs. Not every parameters are required. You can refer to UniversalDataset to obtain detailed descriptions and customize your parameters.

from epilearn.data import UniversalDataset

dataset_sample1 = UniversalDataset(x=node_features,

                        states=node_status, # e.g. additional information of each node, e.g. SIR states

                        y=targets, # prediction target

                        graph=static_graph, # adjacency matrix, we also support edge index: edge_index = ...

                        dynamic_graph=dynamic_graph # # adjacency matrix

                        )

dataset_sample2 = UniversalDataset(x=features,y=node_target,graph=graph)

For more sample code in a real training process, you can refer to examples/dataset_customization.ipynb on the github page.