Datasets

Plain Graphs

Name	#nodes	#edges	#labels	Type	URL
Youtube	1,138,499	2,990,443	47	undirected	[raw] [preprocessed]
TWeibo	2,320,895	50,655,143	100	directed	[raw] [preprocessed]
Orkut	3,072,441	117,185,084	100	undirected	[raw] [preprocessed]
In-2004	1,382,908	16,539,643	-	directed	[raw] [preprocessed]
DBLP	5,425,963	17,298,032	-	undirected	[raw] [preprocessed]
Pokec	1,632,803	30,622,564	-	directed	[raw] [preprocessed]
LiveJournal	4,847,571	68,475,391	-	directed	[raw] [preprocessed]
IT-2004	41,291,594	1,135,718,909	-	directed	[raw] [preprocessed]
Twitter	41,652,230	1,468,365,182	-	directed	[raw] [preprocessed]
Friendster	65,608,366	1,806,067,135	-	undirected	[raw] [preprocessed]
UK-2007	105,896,555	3,738,733,648	-	directed	[raw] [preprocessed]
UK-union	133,633,040	5,475,109,924	-	directed	[raw] [preprocessed]
ClueWeb12	978,408,098	42,574,107,469	-	directed	[raw]
ClueWeb09	1,684,868,322	7,939,635,651	-	directed	[raw] [preprocessed]

Welcome to cite our paper if you publish results based on our preprocessed datasets.

@article{yang13homogeneous,
  title={Homogeneous Network Embedding for Massive Graphs via Reweighted Personalized PageRank},
  author={Yang, Renchi and Shi, Jieming and Xiao, Xiaokui and Yang, Yin and Bhowmick, Sourav S},
  journal={Proceedings of the VLDB Endowment},
  volume={13},
  number={5}
}

@article{shi13realtime,
  title={Realtime Index-Free Single Source SimRank Processing on Web-Scale Graphs},
  author={Shi, Jieming and Jin, Tianyuan and Yang, Renchi and Xiao, Xiaokui and Yang, Yin},
  journal={Proceedings of the VLDB Endowment},
  volume={13},
  number={7}
}

Attributed Graphs

Name	Type	#nodes	#edges	#attributes	#labels	URL
Wiki	directed	2405	17981	4973	19	[raw] [preprocessed]
Cora	directed	2708	5429	1433	7	[raw] [preprocessed]
Citeseer	directed	3312	4660	3703	6	[raw] [preprocessed]
Pubmed	directed	19717	44338	500	3	[raw] [preprocessed]
BlogCatalog	undirected	5196	343486	8189	6	[raw] [preprocessed]
PPI	undirected	56944	818716	50	121	[raw] [preprocessed]
Reddit	undirected	232965	11606919	300	41	[raw] [preprocessed]
Flickr	undirected	7575	479476	12047	9	[raw] [preprocessed]
Facebook	undirected	4039	88234	1283	193	[raw] [preprocessed]
Twitter	directed	81306	1768149	216839	4065	[raw] [preprocessed]
Google+	directed	107614	13673453	15907	468	[raw] [preprocessed]
TWeibo	directed	2320895	50655143	1657	8	[raw] [preprocessed]
MAG	directed	59249719	978147253	2000	100	[raw] [preprocessed]
MAG-SC	directed	10541560	265219994	2784240	8	[raw] [preprocessed]

Tips: node attributes in our preprocessed datasets are compressed as “attrs.pkl” file via cPickle package in Python 2.7 or “attrs.npz” file, which can be loaded as a sparse attribute matrix by using the following code

import cPickle as pickle
features = pickle.load(open("attrs.pkl"))

from scipy import sparse
features = sparse.load_npz("attrs.npz")

Welcome to cite our paper if you publish results based on our preprocessed datasets.

@article{yang2020scaling,
  title={Scaling Attributed Network Embedding to Massive Graphs},
  author={Yang, Renchi and Shi, Jieming and Xiao, Xiaokui and Yang, Yin and Liu, Juncheng and Bhowmick, Sourav S},
  journal={Proceedings of the VLDB Endowment},
  volume={14},
  number={1},
  pages={37--49},
  year={2021},
  publisher={VLDB Endowment}
}