Dataset
In EpiLearn, we use UniversalDataset to load preprocessed datasets. For customized data, we can simply initialize the UniversalDataset given features, graphs, and states.
UniversalDataset
- class epilearn.data.dataset.UniversalDataset(name=None, root='./', x=None, states=None, y=None, graph=None, dynamic_graph=None, edge_index=None, edge_weight=None, edge_attr=None)
UniversalDataset class is designed to handle various types of graph data, enabling operations on datasets that include features, states, dynamic graphs, and edge attributes.
- Parameters:
name (str, optional) – Name of the deataset to be loaded (Supported dataset only).
root (str, optional) – Location of the dataset to be downloaded (Supported dataset only).
x (torch.Tensor, optional) – Node features tensor of shape (num_samples, num_nodes, num_features). Represents the node features over multiple timesteps.
states (torch.Tensor, optional) – Tensor representing various states of nodes, similar in structure to node features.
y (torch.Tensor, optional) – Tensor representing target labels or values for each node, structured similar to node features.
graph (torch.Tensor or scipy.sparse matrix, optional) – Static graph structure as an adjacency matrix.
dynamic_graph (torch.Tensor, optional) – Dynamic graph information over time, providing evolving adjacency matrices.
edge_index (torch.LongTensor, optional) – Tensor containing edge indices, typically of shape (2, num_edges), for defining which nodes are connected.
edge_weight (torch.Tensor, optional) – Edge weights corresponding to the edge_index, providing the strength or capacity of connections.
edge_attr (torch.Tensor, optional) – Attributes or features for each edge, aligned with the structure defined in edge_index.
- classmethod from_csv(feature_csv: str, node_id_col: str, time_col: str, feature_cols: list, target_cols: list | None = None, edge_csv: str | None = None, source_col: str = 'source', target_col: str = 'target', strict_numeric: bool = True)
Load dataset from CSV files and build a UniversalDataset without changing existing behaviors.
- Parameters:
feature_csv – Path to the CSV containing time series features/targets. Must include time_col, node_id_col, feature_cols, and optionally target_cols.
node_id_col – Column name for node identifiers.
time_col – Column name for timestamps (sorted ascending).
feature_cols – List of feature column names (numeric).
target_cols – Optional list of target column names (numeric).
edge_csv – Optional path to an edges CSV with two columns: source_col, target_col.
source_col – Column names for edges CSV.
target_col – Column names for edges CSV.
strict_numeric – If True, raise on any non-numeric entries in features/targets. If False, warn and keep NaNs.
- Returns:
UniversalDataset(x=[T,N,F], y=[T,N] or [T,N,Ty], graph=[N,N], edge_index=[2,E])
- generate_dataset(X=None, Y=None, states=None, dynamic_adj=None, lookback_window_size=1, horizon_size=1, ahead=0, permute=False)
Takes node features for the graph and divides them into multiple samples along the time-axis by sliding a window of size (num_timesteps_input+ num_timesteps_output) across it in steps of 1. :param X: Node features of shape (num_vertices, num_features, num_timesteps) :return:
Node features divided into multiple samples. Shape is
(num_samples, num_vertices, num_features, num_timesteps_input). - Node targets for the samples. Shape is (num_samples, num_vertices, num_features, num_timesteps_output).
- classmethod load_from_csv(feature_csv: str, node_id_col: str, time_col: str, feature_cols: list, target_cols: list | None = None, edge_csv: str | None = None, source_col: str = 'source', target_col: str = 'target', strict_numeric: bool = True)
Load dataset from CSV files and build a UniversalDataset without changing existing behaviors.
- Parameters:
feature_csv – Path to the CSV containing time series features/targets. Must include time_col, node_id_col, feature_cols, and optionally target_cols.
node_id_col – Column name for node identifiers.
time_col – Column name for timestamps (sorted ascending).
feature_cols – List of feature column names (numeric).
target_cols – Optional list of target column names (numeric).
edge_csv – Optional path to an edges CSV with two columns: source_col, target_col.
source_col – Column names for edges CSV.
target_col – Column names for edges CSV.
strict_numeric – If True, raise on any non-numeric entries in features/targets. If False, warn and keep NaNs.
- Returns:
UniversalDataset(x=[T,N,F], y=[T,N] or [T,N,Ty], graph=[N,N], edge_index=[2,E])
Preprocessed Datasets
We collect epidemic data from various sources including the followings:
Temporal Data
Tycho_v1.0.0: Including eight diseases collected across 50 US states and 122 US cities from 1916 to 2009.
Measles: Contains measles infections in England and Wales across 954 urban centers (cities and towns) from 1944 to 1964.
Spatial&Temporal Data
Dataset Loading
Loading Measle and Tycho Datasets:
from epilearn.data import UniversalDataset
tycho_dataset = UniversalDataset(name='Tycho_v1', root='./tmp/')
measle_dataset = UniversalDataset(name='Measles', root='./tmp/')
For covid data, we support the Dataset from Johns Hopkings University:
from epilearn.data import UniversalDataset
jhu_dataset = UniversalDataset(name='JHU_covid', root='./tmp/')
For other countries, please use ‘Covid_’+’country’ to acquire the correspnding covid dataset. Currently, we support countries like China, Brazil, Austria, England, France, Italy, Newzealand, and Spain.
from epilearn.data import UniversalDataset
covid_dataset = UniversalDataset(name='Covid_Brazil', root='./tmp/')
Customize Your Own Dataset
First, you should form your data as a dictionary with keys of features, graph, dynamic_graph, targets, and states. Here is an example:
data = torch.load("example.pt")
data.keys()
dict_keys(['features', 'graph', 'dynamic_graph', 'targets', 'states'])
node_features = data['features'] # [time steps, nodes, channels]: torch.Size([539, 47, 4])
static_graph = torch.Tensor(data['graph']) # [nodes, nodes]: (47, 47)
dynamic_graph = data['dynamic_graph'] # [time steps, nodes, nodes]: torch.Size([539, 47, 47])
targets = data['targets'] # [time steps, nodes]: torch.Size([539, 47])
node_status = data['states'] # [time steps, nodes]: torch.Size([539, 47])
Next, you can use your own data to establish a UniversalDataset class by passing the correponding parameters due to your needs. Not every parameters are required. You can refer to UniversalDataset to obtain detailed descriptions and customize your parameters.
from epilearn.data import UniversalDataset
dataset_sample1 = UniversalDataset(x=node_features,
states=node_status, # e.g. additional information of each node, e.g. SIR states
y=targets, # prediction target
graph=static_graph, # adjacency matrix, we also support edge index: edge_index = ...
dynamic_graph=dynamic_graph # # adjacency matrix
)
dataset_sample2 = UniversalDataset(x=features,y=node_target,graph=graph)
For more sample code in a real training process, you can refer to examples/dataset_customization.ipynb on the github page.