Using *HUGE* Graphs in Matlab
This example will demonstrate running PageRank on a 110M node graph with 1B edges. The system is a 64-bit Dell Linux machine with 2GB of RAM. You must have 2GB of RAM to run this example. Also, you must download the webbase-2001.graph and webbase-2001.properties files from: http://law.dsi.unimi.it/index.php?option=com_include&Itemid=65 Note that this example will not work on a 32-bit version of Matlab running on Windows, even on a machine with 2 GB of RAM. I haven't tested it on a 32-bit version of Linux.
Contents
Load the graph
The first step is to load the graph. The .graph file is about 300 MB, so we are going to stream in from disk, or load it in 'offline' mode.
Set the following path to the location of the webbase-2001.graph file on your machine
if ~exist('webbase_path','var') webbase_path = []; end G = bvgraph([webbase_path 'webbase-2001'],struct('load_type','offline'));
Let's check some simple properties of the graph.
n = size(G,1); nz = nnz(G); s = sum(G,2); % compute the sum to get dangling nodes s = s == 0; d = diag(G); % look at the diagonal to count self-loops fprintf('The graph has\n'); fprintf(' %10i vertices,\n', n); fprintf(' %10i edges,\n', nz); fprintf(' %10i dangling nodes, and\n', sum(s)); fprintf(' %10i self-loops.\n',sum(d));
The graph has 118142155 vertices, 1019903190 edges, 27659673 dangling nodes, and 27058299 self-loops.
Let's see how much memory those operations took.
whos
Name Size Bytes Class Attributes G 118142155x118142155 2844 bvgraph d 118142155x1 945137240 double n 1x1 8 double nz 1x1 8 double s 118142155x1 118142155 logical webbase_path 1x38 76 char
At the moment, we are using quite a bit of memory! When working with these large graph files, we need to be very careful not to exceed the amount of memory on the system. If we do, then Matlab will start swapping or writing memory to disk, which will cause the program to run unreasonably slow.
Check native memory use
Just for fun, let's see how much memory it would require to load G as a Matlab matrix. On a 32-bit platform, a Matlab sparse matrix requires 4 bytes per row, and 5 bytes per non-zero. On a 64-bit platform, it requires 8 bytes per row, and 9 bytes per non-zero.
matlab_32_mem = 4*(size(G,1)+1) + 5*nnz(G); matlab_64_mem = 8*(size(G,1)+1) + 9*nnz(G); fprintf('For an in-memory matrix, Matlab would need\n'); fprintf(' %10i bytes on a 32-bit machine\n', matlab_32_mem); fprintf(' %10i bytes on a 64-bit machine\n', matlab_64_mem);
For an in-memory matrix, Matlab would need 5.572085e+09 bytes on a 32-bit machine 1.012427e+10 bytes on a 64-bit machine
Wow, that's a lot of memory!
Check PageRank memory use
The PageRank computation using the Power Method requires 2 vectors of storage. Let's make sure we can fit everything into memory. A matlab vector takes 8 bytes per row.
pagerank_required_mem = (8*size(G,1))*2; fprintf('The bvpagerank operation requires\n'); fprintf(' %10i bytes of memory\n', pagerank_required_mem);
The bvpagerank operation requires 1890274480 bytes of memory
The operation takes about 1.9 GB of memory. This requirement is why this example requires a machine with 2 GB of memory. Also, we could not run this example if we had loaded the graph into memory, because the graph file takes 300 MB of memory itself. If you have a machine with more than 2 GB of memory, check and see how much faster the code will run with the matrix in memory.
Computing PageRank
With all of that preamble gone, all that is left to do is call the bvpagerank function. Before we do this, however, we first clear all the existing vectors.
clear s d; % WARNING: This takes about 2 hours! x = bvpagerank(G);
iter 1 0.813891 131.32 iter 2 0.349311 94.98 iter 3 0.172124 92.60 iter 4 0.0976978 87.09 iter 5 0.0618677 92.48 iter 6 0.0423836 89.27 iter 7 0.0305012 84.63 iter 8 0.0228049 96.87 iter 9 0.0174295 91.06 iter 10 0.0135588 89.09 iter 11 0.0106725 103.18 iter 12 0.0084911 83.34 iter 13 0.00679389 93.81 iter 14 0.00547703 83.75 iter 15 0.00443213 102.63 iter 16 0.00360332 82.17 iter 17 0.00293702 90.16 iter 18 0.00240339 78.35 iter 19 0.00196934 94.76 iter 20 0.00161912 90.58 iter 21 0.00133268 94.20 iter 22 0.00109956 82.55 iter 23 0.000908065 98.65 iter 24 0.000751579 81.59 iter 25 0.0006223 94.85 iter 26 0.000516324 84.45 iter 27 0.000428583 95.68 iter 28 0.000356288 80.29 iter 29 0.000296286 110.86 iter 30 0.000246791 81.43 iter 31 0.000205566 98.82 iter 32 0.000171479 77.64 iter 33 0.000143059 101.90 iter 34 0.000119484 86.15 iter 35 9.98142e-05 94.79 iter 36 8.34876e-05 84.02 iter 37 6.98221e-05 87.09 iter 38 5.8461e-05 104.60 iter 39 4.89442e-05 81.75 iter 40 4.10159e-05 100.34 iter 41 3.43687e-05 81.69 iter 42 2.88273e-05 101.69 iter 43 2.41743e-05 89.83 iter 44 2.02897e-05 96.05 iter 45 1.70276e-05 82.74 iter 46 1.43012e-05 91.95 iter 47 1.20094e-05 80.62 iter 48 1.0093e-05 84.76 iter 49 8.48006e-06 102.56 iter 50 7.13064e-06 79.20 iter 51 5.99499e-06 102.01 iter 52 5.04344e-06 86.08 iter 53 4.2421e-06 109.53 iter 54 3.57048e-06 109.66 iter 55 3.00453e-06 83.57 iter 56 2.52994e-06 106.36 iter 57 2.12983e-06 108.93 iter 58 1.79406e-06 87.52 iter 59 1.51088e-06 91.49 iter 60 1.27318e-06 88.94 iter 61 1.0726e-06 88.89 iter 62 9.04162e-07 85.97 iter 63 7.62014e-07 99.21 iter 64 6.42519e-07 113.12 iter 65 5.4166e-07 87.23 iter 66 4.56883e-07 102.73 iter 67 3.85279e-07 194.65 iter 68 3.25071e-07 99.62 iter 69 2.74198e-07 103.29 iter 70 2.31403e-07 97.07 iter 71 1.95248e-07 84.93 iter 72 1.64815e-07 98.38 iter 73 1.39091e-07 107.02 iter 74 1.17432e-07 89.96 iter 75 9.91412e-08 107.15 iter 76 8.37159e-08 94.17 iter 77 7.06857e-08 87.45 iter 78 5.97086e-08 148.86 iter 79 5.04163e-08 106.69 iter 80 4.25972e-08 83.06 iter 81 3.59779e-08 90.66 iter 82 3.04075e-08 78.23 iter 83 2.56799e-08 83.60 iter 84 2.17101e-08 82.21 iter 85 1.83416e-08 80.93 iter 86 1.55015e-08 82.54 iter 87 1.31009e-08 77.19 iter 88 1.10774e-08 86.33
A few notes
The bvpagerank code carefully tries to avoid operations that create temporary vectors in Matlab. Towards this end, we write strange sequences of operations like
% x = x - y % delta = norm(x,1)
instead of the more natural Matlab syntax
% delta = norm(x-y,1)
because the latter syntax creates another vector for the difference x-y that is just temporary.