A simple example using the bvgraph class
This sample will demonstrate the bvgraph class. It will show a few features and limitations of the class for working with these large graph files in Matlab.
Contents
bvgraph background
The BV graph compression scheme and the *.graph/BVGraph file type are data structures and techniques created to work with extremely large web graph matrices. Many researchers observed that these matrices have a self-similar structure and exhibit considerable index locality. Both of these features are exploited by the BV graph compression scheme to store these large webgraph matrices at only a few bits per edge.
Paolo Boldi and Sebastiano Vigna wrote the initial BVGraph implementation in Java. While Matlab has nice features to work with Java, these features do not scale to working with extremely large vectors. Consequently, this Matlab package re-implements the procedures necessary to read a BVGraph file in C. Further, it provides routines to operate on these files and makes them look like a Matlab matrix. The goal is to be as efficient as possible and support computing the PageRank vector for extremely large graphs in Matlab.
Load a graph
Loading a graph with the bvgraph class is simple.
G = bvgraph('../data/wb-cs.stanford')
G = bvgraph: 9914-by-9914
There is a slight caveat about loading. The file with most of the data for the bvgraph file is ../data/wb-cs.stanford.graph. However, you must load the file without the .graph extension. For example, you'll get a slightly helpful error if you accidentally forget this. (This may be changed in a future version.)
G2 = bvgraph('../data/wb-cs.stanford.graph');
Error using ==> bvgraph.bvgraph at 67 The file ../data/wb-cs.stanford.graph.graph does not seem to exist!
Loading offline or online
There are two ways of loading a bvgraph file: streaming mode (offline) or inmemory (online). The difference is that when a file is loaded in streaming mode, the class does not actually load the .graph file into memory, but iterates over the file on the disk for every operation. This is slower, but hardly uses any memory on the computer. For example,
G = bvgraph('../data/wb-cs.stanford'); Goff = bvgraph('../data/wb-cs.stanford',struct('load_type','offline')); whos
Name Size Bytes Class Attributes G 9914x9914 27263 bvgraph Goff 9914x9914 2296 bvgraph
Here we find that Goff takes only 3k whereas G takes 28k of memory. In this case, it seems silly to avoid loading 25k of memory to get the offline operations. However, for larger graphs, the difference is much more pronounced. Goff always takes about 3k, where G might takes hundreds of megabytes.
Some simple operations
The bvgraph class in Matlab supports the following simple operations
size_G = size(G) nzcount = nnz(G) d = diag(G); s1 = sum(G); s2 = sum(G,2);
size_G = 9914 9914 nzcount = 36854
Let's use an operation to get the size of the matrix.
n = size(G,1);
Converting to a native Matlab matrix
If the bvgraph file is small enough to fit into Matlab's memory as a Matlab sparse matrix, the sparse call will convert it.
A = sparse(G);
Transposes
The bvgraph class implements an implicit transform. That is, taking the transpose of a matrix does nothing to the underlying data and only changes the interpretation of the data in the Matlab interface.
Gt = G'; isequal(sum(Gt,1)',sum(G,2))
ans = 1
Because only the interpretation of the data is transposed, the istrans function reveals this state
istrans(G) istrans(G')
ans = 0 ans = 1
PageRank opertions
The main goal with implementing the BVGraph format inside of Matlab is to support computing PageRank on really large graphs. The key computation in PageRank is the sub-stochastic matrix vector multiplication
where x is the current iterate, D is the diagonal matrix of degrees, and
is the substochastic matrix corresponding to a random walk on the graph for adjacency matrix A. The bvgraph class supports a specialized method to compute this product without storing a vector for the diagonal of D. However, the substochastic_mult function is a little quirky.
% create the substochastic matrix corresponding to a random walk on A [i j v] = find(A); d = sum(A,2); P = sparse(i, j, v(i)./d(i), size(A,1), size(A,2)); % now P = D^+ A; x = ones(n,1); y_G = substochastic_mult(G,x); y_P = P*x; norm(y_G-y_P,inf)
ans = 0
When applied to G', the substochastic_mult function computes the transpose of the sub-stochastic product.
y_Gt = substochastic_mult(G',x); y_Pt = P'*x; norm(y_Gt-y_Pt,inf)
ans = 0
Since one goal is PageRank, the bvgraph class has a specific PageRank method.
help bvpagerank
x = bvpagerank(G);
--- help for bvgraph/bvpagerank.m --- BVGRAPH/BVPAGERANK A simple implementation of PageRank for a bvgraph object. x = bvpagerank(G) computes an approximate PageRank vector x such that ||alpha*(P+d*v')*x + (1-alpha)*(e*v')*x - x||_1 < tol where alpha = 0.85, v = 1/n for all vertices, tol = 1e-8, and maxiter = 1000. x = bvpagerank(G,alpha,v,tol,maxiter) specifies all of the options of the PageRank system. The type of PageRank system is a strongly personalized PageRank system. To use the default value of an option, specify []. By default, the pagerank function computes the PageRank operation for the matrix as stored _on disk_. For example, bvpagerank(G) and bvpagerank(G') will return the same vector. We use this behavior because computing bvpagerank(G') takes additional memory with respect to pagerank(G). Since the goal of the bvgraph library is to support large-scale PageRank, we chose the memory-minimizing interpretation. This implementation of PageRank will compute the vector for G', however, you must set a final parameter to 0 and call x = pagerank(G',[],[],[],[],0); The algorithm used is the Power method for the PageRank algorithm. Example: G = bvgraph('data/wb-cs.stanford'); x = bvpagerank(G); x = bvpagerank(G,0.9,[],1e-5,10); y = bvpagerank(G',[],[],[],[],0); % turn off native optimizations iter 1 0.588174 0.01 iter 2 0.286471 0.01 iter 3 0.153072 0.01 iter 4 0.0889456 0.01 iter 5 0.0542172 0.01 iter 6 0.0363644 0.01 iter 7 0.0252051 0.01 iter 8 0.0184926 0.01 iter 9 0.0139464 0.01 iter 10 0.0107167 0.01 iter 11 0.00839594 0.01 iter 12 0.00659468 0.01 iter 13 0.00522328 0.01 iter 14 0.00414234 0.01 iter 15 0.00330867 0.01 iter 16 0.00262997 0.01 iter 17 0.00211307 0.01 iter 18 0.00168965 0.01 iter 19 0.00135813 0.01 iter 20 0.0010915 0.01 iter 21 0.000880287 0.01 iter 22 0.000706975 0.01 iter 23 0.000572634 0.01 iter 24 0.000461322 0.01 iter 25 0.000373298 0.01 iter 26 0.000302176 0.01 iter 27 0.000245035 0.01 iter 28 0.000198232 0.01 iter 29 0.000161401 0.01 iter 30 0.000130781 0.01 iter 31 0.000106367 0.01 iter 32 8.65371e-005 0.01 iter 33 7.04309e-005 0.01 iter 34 5.72682e-005 0.01 iter 35 4.67709e-005 0.01 iter 36 3.80577e-005 0.01 iter 37 3.10717e-005 0.01 iter 38 2.53809e-005 0.01 iter 39 2.07268e-005 0.01 iter 40 1.69279e-005 0.01 iter 41 1.38672e-005 0.01 iter 42 1.13277e-005 0.01 iter 43 9.2783e-006 0.01 iter 44 7.60369e-006 0.01 iter 45 6.22754e-006 0.01 iter 46 5.10692e-006 0.01 iter 47 4.19651e-006 0.01 iter 48 3.44086e-006 0.01 iter 49 2.82904e-006 0.01 iter 50 2.32741e-006 0.01 iter 51 1.91483e-006 0.01 iter 52 1.57779e-006 0.01 iter 53 1.30177e-006 0.01 iter 54 1.07226e-006 0.01 iter 55 8.85194e-007 0.01 iter 56 7.30781e-007 0.01 iter 57 6.03263e-007 0.01 iter 58 4.98316e-007 0.01 iter 59 4.12097e-007 0.01 iter 60 3.40546e-007 0.01 iter 61 2.81734e-007 0.01 iter 62 2.33177e-007 0.01 iter 63 1.93033e-007 0.01 iter 64 1.59868e-007 0.01 iter 65 1.32535e-007 0.01 iter 66 1.09897e-007 0.01 iter 67 9.11464e-008 0.01 iter 68 7.56805e-008 0.01 iter 69 6.28511e-008 0.01 iter 70 5.22238e-008 0.01 iter 71 4.34037e-008 0.01 iter 72 3.60912e-008 0.01 iter 73 3.00383e-008 0.01 iter 74 2.49936e-008 0.01 iter 75 2.08175e-008 0.01 iter 76 1.73507e-008 0.01 iter 77 1.4479e-008 0.01 iter 78 1.20793e-008 0.01 iter 79 1.00859e-008 0.01
As noted in the documentation, this method works on the data as on disk and removes any transpose operations performed on the matrix.
y = bvpagerank(G'); norm(x-y,inf)
iter 1 0.588174 0.01 iter 2 0.286471 0.01 iter 3 0.153072 0.01 iter 4 0.0889456 0.01 iter 5 0.0542172 0.01 iter 6 0.0363644 0.01 iter 7 0.0252051 0.01 iter 8 0.0184926 0.01 iter 9 0.0139464 0.01 iter 10 0.0107167 0.01 iter 11 0.00839594 0.01 iter 12 0.00659468 0.01 iter 13 0.00522328 0.01 iter 14 0.00414234 0.01 iter 15 0.00330867 0.01 iter 16 0.00262997 0.01 iter 17 0.00211307 0.01 iter 18 0.00168965 0.01 iter 19 0.00135813 0.01 iter 20 0.0010915 0.01 iter 21 0.000880287 0.01 iter 22 0.000706975 0.01 iter 23 0.000572634 0.01 iter 24 0.000461322 0.01 iter 25 0.000373298 0.01 iter 26 0.000302176 0.01 iter 27 0.000245035 0.01 iter 28 0.000198232 0.01 iter 29 0.000161401 0.01 iter 30 0.000130781 0.01 iter 31 0.000106367 0.01 iter 32 8.65371e-005 0.01 iter 33 7.04309e-005 0.01 iter 34 5.72682e-005 0.01 iter 35 4.67709e-005 0.01 iter 36 3.80577e-005 0.01 iter 37 3.10717e-005 0.01 iter 38 2.53809e-005 0.01 iter 39 2.07268e-005 0.01 iter 40 1.69279e-005 0.01 iter 41 1.38672e-005 0.01 iter 42 1.13277e-005 0.01 iter 43 9.2783e-006 0.01 iter 44 7.60369e-006 0.01 iter 45 6.22754e-006 0.01 iter 46 5.10692e-006 0.01 iter 47 4.19651e-006 0.01 iter 48 3.44086e-006 0.01 iter 49 2.82904e-006 0.01 iter 50 2.32741e-006 0.01 iter 51 1.91483e-006 0.01 iter 52 1.57779e-006 0.01 iter 53 1.30177e-006 0.01 iter 54 1.07226e-006 0.01 iter 55 8.85194e-007 0.01 iter 56 7.30781e-007 0.01 iter 57 6.03263e-007 0.01 iter 58 4.98316e-007 0.01 iter 59 4.12097e-007 0.01 iter 60 3.40546e-007 0.01 iter 61 2.81734e-007 0.01 iter 62 2.33177e-007 0.01 iter 63 1.93033e-007 0.01 iter 64 1.59868e-007 0.01 iter 65 1.32535e-007 0.01 iter 66 1.09897e-007 0.01 iter 67 9.11464e-008 0.01 iter 68 7.56805e-008 0.01 iter 69 6.28511e-008 0.01 iter 70 5.22238e-008 0.01 iter 71 4.34037e-008 0.01 iter 72 3.60912e-008 0.01 iter 73 3.00383e-008 0.01 iter 74 2.49936e-008 0.01 iter 75 2.08175e-008 0.01 iter 76 1.73507e-008 0.01 iter 77 1.4479e-008 0.01 iter 78 1.20793e-008 0.01 iter 79 1.00859e-008 0.01 ans = 0
To call PageRank on the tranpose (which takes additional memory, see
What isn't implemented (yet)
Currently, we do not support indexed accesses to the bvgraph type, so the following feature will fail
G(1,2)
Error using ==> evalin Index exceeds matrix dimensions.
Additionally, the multiplication operation is only supported for a double type.
x = true(size(G,1),1); G*x
the vector x is not a double
Finally, there is no way to output a Matlab matrix as a bvgraph file.