# Pagerank on subgraphs‚Äîefficient Monte-Carlo estimation

In this repo you can find the reference code for my novel Subrank algorithm for efficiently computing the Pagerank distribution over $S$ subgraph of $G$.
For the reasoning behind the algorithm, the definition and the analysis, I invite the interested reader to [read the paper](https://pippellia.com/pippellia/Social+Graph/Pagerank+on+subgraphs%E2%80%94efficient+Monte-Carlo+estimation).

To play with it, follow these steps:

## Step 0: Build and store the Graph

In [1]:
# Imports
from nostr_dvm.utils.wot_utils import build_wot_network, save_network, load_network, get_mc_pagerank, get_subrank, get_metadata, print_results
import time
import networkx as nx
import random




In [9]:
user = '99bb5591c9116600f845107d31f9b59e2f7c7e09a1ff802e84f1d43da557ca64'
show_results_num = 20
use_files = False
fetch_metadata = True

In [6]:
index_map, G = await build_wot_network(user, depth=2, max_batch=500, max_time_request=10)
if use_files:
    save_network(index_map, G, user)

Step 1: fetching kind 3 events from relays & pre-processing
current network: 44014 npubs
Finished in 58.22388768196106


## Step 1: load the graph database

First, you have to load the networkx graph database into memory by running the following code.

In [7]:
if use_files:
    # loading the database
    print('loading the database...')
    tic = time.time()
    
    index_map, G = load_network(user)
    
    toc = time.time()
    print(f'finished in {toc-tic} seconds')

## Step 2: Compute Pagerank over $G$

Compute the pagerank over $G$ by using the networkx built-in pagerank function that uses the power iteration method.
This vector will be considered as the real Pagerank vector and will be used to compute the errors of the Monte-Carlo algorithm.

In [10]:
# computing the pagerank
print('computing global pagerank...')
tic = time.time()

p_G = nx.pagerank(G, tol=1e-12)
    
await print_results(p_G, index_map, show_results_num, getmetadata=fetch_metadata)
    
toc = time.time()
print(f'finished in {toc-tic} seconds')

computing global pagerank...
Don't ‚Çøelieve the Hype ü¶ä(npub1nxa4tywfz9nqp7z9zp7nr7d4nchhclsf58lcqt5y782rmf2hefjquaa6q8) 8.163684335574227e-05
The: Daniel‚ö°Ô∏è(npub1aeh2zw4elewy5682lxc6xnlqzjnxksq303gwu2npfaxd49vmde6qcq4nwx) 5.7772373353072425e-05
zach(npub10fu0hlkx3s4n4dsgfu0cpqephga4afr4qtzpz9vsyqf7vj88v2yqdp8vp4) 4.225869500489523e-05
elsat(npub1zafcms4xya5ap9zr7xxr0jlrtrattwlesytn2s42030lzu0dwlzqpd26k5) 4.740428451060313e-05
opreturnbot(npub1e30jt8crv6phnrj22gr3mwuhywrs9lak7ry94akjw0ydm0juptas5xmkwq) 2.8600984950627014e-05
Seth For Privacy(npub1tr4dstaptd2sp98h7hlysp8qle6mw7wmauhfkgz3rmxdd8ndprusnw2y5g) 4.345057953406442e-05
JeffG(npub1zuuajd7u3sx8xu92yav9jwxpr839cs0kc3q6t56vd5u9q033xmhsk6c2uc) 6.123308353585933e-05
CARLA(npub1hu3hdctm5nkzd8gslnyedfr5ddz3z547jqcl5j88g4fame2jd08qh6h8nh) 7.928631886973004e-05
Derek Ross(npub18ams6ewn5aj2n3wt2qawzglx9mr4nzksxhvrdc4gzrecw7n5tvjqctp424) 9.094182894540007e-05
DickWhitman(npub102a0auqvye3eayugvfwy44un9l477t45uck8s2p08xzpgh784uvslsh7w9

## Step 3: Approximate Pagerank over $G$ using Monte-Carlo

Compute the pagerank over $G$ using a simple Monte-Carlo implementation and compute the L1 error.
This step is essential because it returns the csr-matrix `walk_visited_count`, that will be used later by the Subrank algorithm.

In [11]:

# number of the random walks per node
R = 10

# fix the order of the nodes
nodelist = list(G.nodes())

tic = time.time()

# perform the random walks and get the monte-carlo pagerank
walk_visited_count, mc_pagerank = get_mc_pagerank(G, R, nodelist)

await print_results(mc_pagerank, index_map, show_results_num, getmetadata=fetch_metadata)

toc = time.time()
print(f'performed random walks in {toc-tic} seconds')

# computing the L1 error
error_G_mc = sum( abs(p_G[node] - mc_pagerank[node])
                  for node in G.nodes() )

print(f'error pagerank vs mc pagerank in G = {error_G_mc}')

progress = 100%       
Total walks performed:  440140
Don't ‚Çøelieve the Hype ü¶ä(npub1nxa4tywfz9nqp7z9zp7nr7d4nchhclsf58lcqt5y782rmf2hefjquaa6q8) 8.164129902339354e-05
The: Daniel‚ö°Ô∏è(npub1aeh2zw4elewy5682lxc6xnlqzjnxksq303gwu2npfaxd49vmde6qcq4nwx) 4.854347509499075e-05
zach(npub10fu0hlkx3s4n4dsgfu0cpqephga4afr4qtzpz9vsyqf7vj88v2yqdp8vp4) 4.413043190453705e-05
elsat(npub1zafcms4xya5ap9zr7xxr0jlrtrattwlesytn2s42030lzu0dwlzqpd26k5) 4.63369534997639e-05
opreturnbot(npub1e30jt8crv6phnrj22gr3mwuhywrs9lak7ry94akjw0ydm0juptas5xmkwq) 3.0891302333175936e-05
Seth For Privacy(npub1tr4dstaptd2sp98h7hlysp8qle6mw7wmauhfkgz3rmxdd8ndprusnw2y5g) 4.63369534997639e-05
JeffG(npub1zuuajd7u3sx8xu92yav9jwxpr839cs0kc3q6t56vd5u9q033xmhsk6c2uc) 7.722825583293983e-05
CARLA(npub1hu3hdctm5nkzd8gslnyedfr5ddz3z547jqcl5j88g4fame2jd08qh6h8nh) 7.060869104725928e-05
Derek Ross(npub18ams6ewn5aj2n3wt2qawzglx9mr4nzksxhvrdc4gzrecw7n5tvjqctp424) 8.605434221384724e-05
DickWhitman(npub102a0auqvye3eayugvfwy44un9l477t45uck

## Step 4: Select random subgraph $S$ and compute its Pagerank distribution

Select a random subgraph $S$ consisting of 50k nodes, and compute its Pagerank distribution.

In [12]:
# selecting random subgraph S
S_nodes = set(random.sample(list(G.nodes()), k=500)) #50000
S = G.subgraph(S_nodes).copy()

# computing pagerank over S
print('computing local pagerank...')
tic = time.time()

p_S = nx.pagerank(S, tol=1e-12)
await print_results(p_S, index_map, show_results_num, getmetadata=fetch_metadata)


toc = time.time()
print(f'finished in {toc-tic} seconds')

computing local pagerank...
Robert(npub1zv62e6wxx4lnsnfuwek9xpxlt3ahx6xda7e3zh5w5dkzz5md9lps6ggzf0) 0.0019634124467849253
aragol7(npub1ptwt040pjt3pd2lx9x0reysshwu5l7t7gtclza94aty3f008y4csut9jy7) 0.0019634124467849253
Alex Jones(npub1fg38s8xuhn4petadndvekvspkz7vpmdundq4vza5fc4v9el4cd9qwghuct) 0.0019634124467849253
rohitkumarjain(npub17jp3xlr5quxul9nxh2muhqk5qm76thq974cx4wfvvztav9fejkrqc0w0tj) 0.0019634124467849253
Feynman(npub19xt4d6epa8xtse8x6wh0fqz0hc5kzu7cwr0677t2kshlrjzs2nzserr5fk) 0.0019634124467849253
saunter(npub1l3gfsderx4ktqhcmwzgegwatkv9v6fs0hujvlwznje0c90xm7m6qs2s6a5) 0.0019634124467849253
ACME(npub1pu5x5dmkryc7sp20399lvm6sh9rnp9gydwuc9jug6r88kcq6t85qalqymy) 0.0022018268153709605
Ali (npub13es8zhzmvmhfa0ekxm74ah94nhall24ke2005kdlkkcwwxlm5qaqpdxfxk) 0.0025831954811405765
(npub1egw0ecrcyxytmsl7kx2hjmrp2pua354dt2k23mjc8z4g4pwkqqvs68cr06) 0.0019634124467849253
nekio(npub1hzdf5vjg0hz7yxjvzrtvatv0wcjg52gd6a3ryerv5w79rfj5kzws3yf3mm) 0.0019634124467849253
rafbe(npub1f4z7l8x59ftwp76zn

## Step 4b: Use integrated functions

Run the Subrank algorithm to approximate the Pagerank over $S$ subgraph of $G$. Then compute the L1 error.

In [14]:
# computing subrank
print('computing integrated pagerang function')
tic = time.time()

pr = nx.pagerank(G)
await print_results(pr, index_map, show_results_num, getmetadata=fetch_metadata)
print(f'performed random walks in {toc-tic} seconds')

# computing the L1 error
error_S_subrank = sum( abs(p_S[node] - pr[node])
                      for node in S_nodes )

print(f'error pagerank vs subrank in S = {error_S_subrank}')

computing inteegrated pagerang function
Don't ‚Çøelieve the Hype ü¶ä(npub1nxa4tywfz9nqp7z9zp7nr7d4nchhclsf58lcqt5y782rmf2hefjquaa6q8) 7.33073810886042e-05
The: Daniel‚ö°Ô∏è(npub1aeh2zw4elewy5682lxc6xnlqzjnxksq303gwu2npfaxd49vmde6qcq4nwx) 4.82284734499105e-05
zach(npub10fu0hlkx3s4n4dsgfu0cpqephga4afr4qtzpz9vsyqf7vj88v2yqdp8vp4) 3.6096112371534015e-05
elsat(npub1zafcms4xya5ap9zr7xxr0jlrtrattwlesytn2s42030lzu0dwlzqpd26k5) 3.932073936226785e-05
opreturnbot(npub1e30jt8crv6phnrj22gr3mwuhywrs9lak7ry94akjw0ydm0juptas5xmkwq) 2.6532755448179916e-05
Seth For Privacy(npub1tr4dstaptd2sp98h7hlysp8qle6mw7wmauhfkgz3rmxdd8ndprusnw2y5g) 3.7662079240686915e-05
JeffG(npub1zuuajd7u3sx8xu92yav9jwxpr839cs0kc3q6t56vd5u9q033xmhsk6c2uc) 5.037600211657074e-05
CARLA(npub1hu3hdctm5nkzd8gslnyedfr5ddz3z547jqcl5j88g4fame2jd08qh6h8nh) 6.244046482882089e-05
Derek Ross(npub18ams6ewn5aj2n3wt2qawzglx9mr4nzksxhvrdc4gzrecw7n5tvjqctp424) 7.071551628113585e-05
DickWhitman(npub102a0auqvye3eayugvfwy44un9l477t45uck8s2p08xzpgh78

In [15]:
# computing subrank
print('computing subrank over S...')
tic = time.time()

subrank = get_subrank(S, G, walk_visited_count, nodelist)
await print_results(subrank, index_map, show_results_num, getmetadata=fetch_metadata)
    
print(f'performed random walks in {toc-tic} seconds')
print(f'performed random walks in {toc-tic} seconds')

# computing the L1 error
error_S_subrank = sum( abs(p_S[node] - subrank[node])
                      for node in S_nodes )

print(f'error pagerank vs subrank in S = {error_S_subrank}')

computing subrank over S...
walks performed = 75
Robert(npub1zv62e6wxx4lnsnfuwek9xpxlt3ahx6xda7e3zh5w5dkzz5md9lps6ggzf0) 0.0019623233908948193
aragol7(npub1ptwt040pjt3pd2lx9x0reysshwu5l7t7gtclza94aty3f008y4csut9jy7) 0.0019623233908948193
Alex Jones(npub1fg38s8xuhn4petadndvekvspkz7vpmdundq4vza5fc4v9el4cd9qwghuct) 0.0019623233908948193
rohitkumarjain(npub17jp3xlr5quxul9nxh2muhqk5qm76thq974cx4wfvvztav9fejkrqc0w0tj) 0.0019623233908948193
Feynman(npub19xt4d6epa8xtse8x6wh0fqz0hc5kzu7cwr0677t2kshlrjzs2nzserr5fk) 0.0019623233908948193
saunter(npub1l3gfsderx4ktqhcmwzgegwatkv9v6fs0hujvlwznje0c90xm7m6qs2s6a5) 0.0019623233908948193
ACME(npub1pu5x5dmkryc7sp20399lvm6sh9rnp9gydwuc9jug6r88kcq6t85qalqymy) 0.0021585557299843012
Ali (npub13es8zhzmvmhfa0ekxm74ah94nhall24ke2005kdlkkcwwxlm5qaqpdxfxk) 0.002551020408163265
(npub1egw0ecrcyxytmsl7kx2hjmrp2pua354dt2k23mjc8z4g4pwkqqvs68cr06) 0.0019623233908948193
nekio(npub1hzdf5vjg0hz7yxjvzrtvatv0wcjg52gd6a3ryerv5w79rfj5kzws3yf3mm) 0.0019623233908948193
rafbe(np

## Step 6: Approximate Pagerank over $S$ using Monte-Carlo naive recomputation

Run the Monte-Carlo Pagerank algorithm on $S$ as a reference for the number of random walks required and the error achieved.

In [16]:

# computing the monte-carlo pagerank 
print('computing naive monte-carlo pagerank over S')
tic = time.time()

_, mc_pagerank_S_naive = get_mc_pagerank(S,R)

await print_results(mc_pagerank_S_naive, index_map, show_results_num, getmetadata=fetch_metadata)


toc = time.time()
print(f'finished in {toc-tic} seconds')

# computing the L1 error
error_S_naive = sum( abs(p_S[node] - mc_pagerank_S_naive[node])
                      for node in S.nodes())

print(f'error pagerank vs mc pagerank in S = {error_S_naive}')

computing naive monte-carlo pagerank over S
progress = 100%       
Total walks performed:  5000
Robert(npub1zv62e6wxx4lnsnfuwek9xpxlt3ahx6xda7e3zh5w5dkzz5md9lps6ggzf0) 0.0019654088050314465
aragol7(npub1ptwt040pjt3pd2lx9x0reysshwu5l7t7gtclza94aty3f008y4csut9jy7) 0.0019654088050314465
Alex Jones(npub1fg38s8xuhn4petadndvekvspkz7vpmdundq4vza5fc4v9el4cd9qwghuct) 0.0019654088050314465
rohitkumarjain(npub17jp3xlr5quxul9nxh2muhqk5qm76thq974cx4wfvvztav9fejkrqc0w0tj) 0.0019654088050314465
Feynman(npub19xt4d6epa8xtse8x6wh0fqz0hc5kzu7cwr0677t2kshlrjzs2nzserr5fk) 0.0019654088050314465
saunter(npub1l3gfsderx4ktqhcmwzgegwatkv9v6fs0hujvlwznje0c90xm7m6qs2s6a5) 0.0019654088050314465
ACME(npub1pu5x5dmkryc7sp20399lvm6sh9rnp9gydwuc9jug6r88kcq6t85qalqymy) 0.0019654088050314465
Ali (npub13es8zhzmvmhfa0ekxm74ah94nhall24ke2005kdlkkcwwxlm5qaqpdxfxk) 0.0033411949685534592
(npub1egw0ecrcyxytmsl7kx2hjmrp2pua354dt2k23mjc8z4g4pwkqqvs68cr06) 0.0019654088050314465
nekio(npub1hzdf5vjg0hz7yxjvzrtvatv0wcjg52gd6a3ryerv5