Wikipedia clickstream

The Wikipedia Clickstream dataset contains counts of (referer, resource) pairs extracted from the request logs of Wikipedia. A referer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on. Documentation here.

This exploration takes the .tsv for English Wikipedia from March 2020 that is available here. There are other languages available.

Other Analytics Resources for Wikimedia

In [1]:
import pandas as pd
import re
In [2]:
df = pd.read_csv('../clickstream-enwiki-2020-03.tsv', delimiter='\t', header=None, names=['prev', 'curr', 'type', 'n'], usecols=[0, 1, 2, 3])
df.head()
Out[2]:
prev curr type n
0 other-empty Kamasḥalta external 11
1 other-search Melanie_Windridge external 64
2 other-empty Melanie_Windridge external 13
3 other-empty Malaysia–Namibia_relations external 15
4 other-search Ding_Chao external 19

Top Articles

In [3]:
df.groupby('curr').sum().sort_values('n', ascending=False)[:5]
Out[3]:
n
curr
Main_Page 300944959
United_States_Senate 132753424
Hyphen-minus 55283299
2019–20_coronavirus_pandemic 30709138
Coronavirus 11074539

Top Referers

This post by Ellery Wulczyn (Data Scientist @ WMF) states that other-empty (refererless traffic) usually comes from clients using HTTPS.

In [4]:
df.groupby('prev').sum().sort_values('n', ascending=False)[:5]
Out[4]:
n
prev
other-search 3204375198
other-empty 1980695162
other-internal 145878947
other-external 90525286
Main_Page 36256389

Outgoing requests from main page

In [5]:
outgoingWPMain = df.loc[(df['prev'] == 'Main_Page')]
outgoingWPMain.sort_values('n', ascending=False)[:5]
Out[5]:
prev curr type n
18436309 Main_Page 2019–20_coronavirus_pandemic link 3132537
18987983 Main_Page Hyphen-minus other 3034711
25130927 Main_Page Deaths_in_2020 link 1407219
16102638 Main_Page 2019–20_coronavirus_pandemic_by_country_and_te... link 210114
5421934 Main_Page Coronavirus_disease_2019 link 177632

Coronavirus data exploration

In [6]:
coronaDf = df.loc[(df['prev'] == '2019–20_coronavirus_pandemic') | (df['curr'] == '2019–20_coronavirus_pandemic')]
coronaDf.sort_values('n', ascending=False)
Out[6]:
prev curr type n
18437380 other-search 2019–20_coronavirus_pandemic external 10653762
18434022 other-empty 2019–20_coronavirus_pandemic external 9985625
18436309 Main_Page 2019–20_coronavirus_pandemic link 3132537
27228795 2019–20_coronavirus_pandemic 2020_coronavirus_pandemic_in_the_United_States link 2006708
18435884 other-internal 2019–20_coronavirus_pandemic external 1704299
... ... ... ... ...
18438802 We_Are_Number_One 2019–20_coronavirus_pandemic other 10
18435317 Benito_Mussolini 2019–20_coronavirus_pandemic other 10
3514911 2019–20_coronavirus_pandemic Nick_Foles other 10
18437362 Les_Prophéties 2019–20_coronavirus_pandemic other 10
14520135 2019–20_coronavirus_pandemic Foodborne_illness other 10

9907 rows × 4 columns

In [7]:
exportCov = coronaDf
exportCov.columns = ["source", "target", "type", "value"]
exportCov = exportCov.sort_values('value', ascending=False)
targetsAlsoSources = []
for i, row in enumerate(exportCov.itertuples(), 1):
    #row[2] is target, [1] is source, [0] is index
    if exportCov.loc[exportCov['source'] == row[2]].count()['source'] > 0 and row[2] != '2019–20_coronavirus_pandemic':
        targetsAlsoSources.append(row[2])
exportCov.loc[exportCov['target'].isin(targetsAlsoSources), 'target'] = exportCov['target'] + " *"
exportCov.to_csv("InOut_2019–20_coronavirus_pandemic.tsv", index=False, sep="\t" )
exportCov[:100].to_csv("InOutTop100_2019–20_coronavirus_pandemic.tsv", index=False, sep="\t" )
exportCov
Out[7]:
source target type value
18437380 other-search 2019–20_coronavirus_pandemic external 10653762
18434022 other-empty 2019–20_coronavirus_pandemic external 9985625
18436309 Main_Page 2019–20_coronavirus_pandemic link 3132537
27228795 2019–20_coronavirus_pandemic 2020_coronavirus_pandemic_in_the_United_States * link 2006708
18435884 other-internal 2019–20_coronavirus_pandemic external 1704299
... ... ... ... ...
18438802 We_Are_Number_One 2019–20_coronavirus_pandemic other 10
18435317 Benito_Mussolini 2019–20_coronavirus_pandemic other 10
3514911 2019–20_coronavirus_pandemic Nick_Foles other 10
18437362 Les_Prophéties 2019–20_coronavirus_pandemic other 10
14520135 2019–20_coronavirus_pandemic Foodborne_illness other 10

9907 rows × 4 columns

Incoming requests to main pandemic article

In [8]:
coronaDf.columns = ["prev", "curr", "type", "n"]
incomingMain = coronaDf.loc[(coronaDf['curr'] == '2019–20_coronavirus_pandemic')]
incomingMain.groupby('prev').sum().sort_values('n', ascending=False)[:10]
Out[8]:
n
prev
other-search 10653762
other-empty 9985625
Main_Page 3132537
other-internal 1704299
Coronavirus_disease_2019 1203042
Coronavirus 759147
Severe_acute_respiratory_syndrome_coronavirus_2 263889
other-external 195789
Pandemic 132594
2020_coronavirus_pandemic_in_the_United_States 125656

Outgoing requests from main pandemic article

In [9]:
outgoingMain = df.loc[(df['prev'] == '2019–20_coronavirus_pandemic')]
outgoingMain.groupby('curr').sum().sort_values('n', ascending=False)[:10]
Out[9]:
n
curr
2020_coronavirus_pandemic_in_the_United_States 2006708
2020_coronavirus_pandemic_in_Italy 1297838
2019–20_coronavirus_pandemic_by_country_and_territory 679592
2020_coronavirus_pandemic_in_Spain 624386
2020_coronavirus_pandemic_in_Germany 550413
2020_coronavirus_pandemic_in_the_United_Kingdom 469164
2019–20_coronavirus_pandemic_in_mainland_China 453397
2020_coronavirus_pandemic_in_India 376995
2020_coronavirus_pandemic_in_France 341264
Coronavirus_disease_2019 308027
In [10]:
incomingMain.sum()
Out[10]:
prev    Anglophone_CrisisHamburg2020_coronavirus_pande...
curr    2019–20_coronavirus_pandemic2019–20_coronaviru...
type    linkotherlinkotherlinklinklinkexternallinkothe...
n                                                30709138
dtype: object
In [11]:
outgoingMain.sort_values('n', ascending=False)
Out[11]:
prev curr type n
27228795 2019–20_coronavirus_pandemic 2020_coronavirus_pandemic_in_the_United_States link 2006708
29488393 2019–20_coronavirus_pandemic 2020_coronavirus_pandemic_in_Italy link 1297838
16102610 2019–20_coronavirus_pandemic 2019–20_coronavirus_pandemic_by_country_and_te... link 679592
10947455 2019–20_coronavirus_pandemic 2020_coronavirus_pandemic_in_Spain link 624386
12622707 2019–20_coronavirus_pandemic 2020_coronavirus_pandemic_in_Germany link 550413
... ... ... ... ...
29289909 2019–20_coronavirus_pandemic German_Empire other 10
29240726 2019–20_coronavirus_pandemic Roman_Reigns other 10
21142317 2019–20_coronavirus_pandemic Ahsoka_Tano other 10
29201601 2019–20_coronavirus_pandemic United_Airlines other 10
21856147 2019–20_coronavirus_pandemic Jay-Z other 10

4031 rows × 4 columns

Outgoing requests from main pandemic article, and is a link from that article

In [12]:
outgoingMainLinks = outgoingMain.loc[(outgoingMain['type'] == 'link')]
outgoingMainLinks.sort_values('n', ascending=False)
Out[12]:
prev curr type n
27228795 2019–20_coronavirus_pandemic 2020_coronavirus_pandemic_in_the_United_States link 2006708
29488393 2019–20_coronavirus_pandemic 2020_coronavirus_pandemic_in_Italy link 1297838
16102610 2019–20_coronavirus_pandemic 2019–20_coronavirus_pandemic_by_country_and_te... link 679592
10947455 2019–20_coronavirus_pandemic 2020_coronavirus_pandemic_in_Spain link 624386
12622707 2019–20_coronavirus_pandemic 2020_coronavirus_pandemic_in_Germany link 550413
... ... ... ... ...
16268096 2019–20_coronavirus_pandemic Handle_System link 10
1817503 2019–20_coronavirus_pandemic MS_Zaandam link 10
28775653 2019–20_coronavirus_pandemic Indian_local_government_response_to_the_2020_c... link 10
1205451 2019–20_coronavirus_pandemic Our_World_in_Data link 10
9330442 2019–20_coronavirus_pandemic 2020_coronavirus_pandemic_in_South_Asia link 10

1308 rows × 4 columns

Searches while on main pandemic article

type link means main pandemic article links to request (in the article); type other could mean a search, but also could be an incorrect referer

In [13]:
outgoingMain.groupby('type').sum()
Out[13]:
n
type
link 13879231
other 454349
In [14]:
coronaMainSearch = coronaDf.loc[coronaDf['type'] == 'other']
coronaMainSearch.sort_values('n', ascending=False)
Out[14]:
prev curr type n
20893419 2019–20_coronavirus_pandemic Main_Page other 97652
18998157 2019–20_coronavirus_pandemic Hyphen-minus other 33262
1903860 2019–20_coronavirus_pandemic Horseshoe_bat other 32520
29735986 2019–20_coronavirus_pandemic Hemoptysis other 18386
11672604 2019–20_coronavirus_pandemic Worldometer other 16241
... ... ... ... ...
18439302 Coyote 2019–20_coronavirus_pandemic other 10
18439298 Roberto_Benigni 2019–20_coronavirus_pandemic other 10
18436379 State_of_Palestine 2019–20_coronavirus_pandemic other 10
18439295 Emilio_Salgari 2019–20_coronavirus_pandemic other 10
18435143 Big_Little_Lies_(TV_series) 2019–20_coronavirus_pandemic other 10

6152 rows × 4 columns