Wikipedia clickstream¶

The Wikipedia Clickstream dataset contains counts of (referer, resource) pairs extracted from the request logs of Wikipedia. A referer is an HTTP header field that identifies the address of the webpage that linked to the resource being requested. The data shows how people get to a Wikipedia article and what links they click on. Documentation here.

This exploration takes the .tsv for English Wikipedia from March 2020 that is available here. There are other languages available.

Other Analytics Resources for Wikimedia

import pandas as pd
import re

df = pd.read_csv('../clickstream-enwiki-2020-03.tsv', delimiter='\t', header=None, names=['prev', 'curr', 'type', 'n'], usecols=[0, 1, 2, 3])
df.head()

Top Referers¶

This post by Ellery Wulczyn (Data Scientist @ WMF) states that other-empty (refererless traffic) usually comes from clients using HTTPS.

df.groupby('prev').sum().sort_values('n', ascending=False)[:5]

Outgoing requests from main page¶

outgoingWPMain = df.loc[(df['prev'] == 'Main_Page')]
outgoingWPMain.sort_values('n', ascending=False)[:5]

Coronavirus data exploration¶

coronaDf = df.loc[(df['prev'] == '2019–20_coronavirus_pandemic') | (df['curr'] == '2019–20_coronavirus_pandemic')]
coronaDf.sort_values('n', ascending=False)

exportCov = coronaDf
exportCov.columns = ["source", "target", "type", "value"]
exportCov = exportCov.sort_values('value', ascending=False)
targetsAlsoSources = []
for i, row in enumerate(exportCov.itertuples(), 1):
    #row[2] is target, [1] is source, [0] is index
    if exportCov.loc[exportCov['source'] == row[2]].count()['source'] > 0 and row[2] != '2019–20_coronavirus_pandemic':
        targetsAlsoSources.append(row[2])
exportCov.loc[exportCov['target'].isin(targetsAlsoSources), 'target'] = exportCov['target'] + " *"
exportCov.to_csv("InOut_2019–20_coronavirus_pandemic.tsv", index=False, sep="\t" )
exportCov[:100].to_csv("InOutTop100_2019–20_coronavirus_pandemic.tsv", index=False, sep="\t" )
exportCov

Incoming requests to main pandemic article¶

coronaDf.columns = ["prev", "curr", "type", "n"]
incomingMain = coronaDf.loc[(coronaDf['curr'] == '2019–20_coronavirus_pandemic')]
incomingMain.groupby('prev').sum().sort_values('n', ascending=False)[:10]

Outgoing requests from main pandemic article¶

outgoingMain = df.loc[(df['prev'] == '2019–20_coronavirus_pandemic')]
outgoingMain.groupby('curr').sum().sort_values('n', ascending=False)[:10]

incomingMain.sum()

prev    Anglophone_CrisisHamburg2020_coronavirus_pande...
curr    2019–20_coronavirus_pandemic2019–20_coronaviru...
type    linkotherlinkotherlinklinklinkexternallinkothe...
n                                                30709138
dtype: object

outgoingMain.sort_values('n', ascending=False)

Outgoing requests from main pandemic article, and is a link from that article¶

outgoingMainLinks = outgoingMain.loc[(outgoingMain['type'] == 'link')]
outgoingMainLinks.sort_values('n', ascending=False)

Searches while on main pandemic article¶

type link means main pandemic article links to request (in the article); type other could mean a search, but also could be an incorrect referer

outgoingMain.groupby('type').sum()

coronaMainSearch = coronaDf.loc[coronaDf['type'] == 'other']
coronaMainSearch.sort_values('n', ascending=False)

	n
curr
Main_Page	300944959
United_States_Senate	132753424
Hyphen-minus	55283299
2019–20_coronavirus_pandemic	30709138
Coronavirus	11074539

	n
prev
other-search	3204375198
other-empty	1980695162
other-internal	145878947
other-external	90525286
Main_Page	36256389

	n
prev
other-search	10653762
other-empty	9985625
Main_Page	3132537
other-internal	1704299
Coronavirus_disease_2019	1203042
Coronavirus	759147
Severe_acute_respiratory_syndrome_coronavirus_2	263889
other-external	195789
Pandemic	132594
2020_coronavirus_pandemic_in_the_United_States	125656

	n
curr
2020_coronavirus_pandemic_in_the_United_States	2006708
2020_coronavirus_pandemic_in_Italy	1297838
2019–20_coronavirus_pandemic_by_country_and_territory	679592
2020_coronavirus_pandemic_in_Spain	624386
2020_coronavirus_pandemic_in_Germany	550413
2020_coronavirus_pandemic_in_the_United_Kingdom	469164
2019–20_coronavirus_pandemic_in_mainland_China	453397
2020_coronavirus_pandemic_in_India	376995
2020_coronavirus_pandemic_in_France	341264
Coronavirus_disease_2019	308027

	prev	curr	type	n
0	other-empty	Kamasḥalta	external	11
1	other-search	Melanie_Windridge	external	64
2	other-empty	Melanie_Windridge	external	13
3	other-empty	Malaysia–Namibia_relations	external	15
4	other-search	Ding_Chao	external	19

	prev	curr	type	n
18436309	Main_Page	2019–20_coronavirus_pandemic	link	3132537
18987983	Main_Page	Hyphen-minus	other	3034711
25130927	Main_Page	Deaths_in_2020	link	1407219
16102638	Main_Page	2019–20_coronavirus_pandemic_by_country_and_te...	link	210114
5421934	Main_Page	Coronavirus_disease_2019	link	177632

	prev	curr	type	n
18437380	other-search	2019–20_coronavirus_pandemic	external	10653762
18434022	other-empty	2019–20_coronavirus_pandemic	external	9985625
18436309	Main_Page	2019–20_coronavirus_pandemic	link	3132537
27228795	2019–20_coronavirus_pandemic	2020_coronavirus_pandemic_in_the_United_States	link	2006708
18435884	other-internal	2019–20_coronavirus_pandemic	external	1704299
...	...	...	...	...
18438802	We_Are_Number_One	2019–20_coronavirus_pandemic	other	10
18435317	Benito_Mussolini	2019–20_coronavirus_pandemic	other	10
3514911	2019–20_coronavirus_pandemic	Nick_Foles	other	10
18437362	Les_Prophéties	2019–20_coronavirus_pandemic	other	10
14520135	2019–20_coronavirus_pandemic	Foodborne_illness	other	10

	prev	curr	type	n
20893419	2019–20_coronavirus_pandemic	Main_Page	other	97652
18998157	2019–20_coronavirus_pandemic	Hyphen-minus	other	33262
1903860	2019–20_coronavirus_pandemic	Horseshoe_bat	other	32520
29735986	2019–20_coronavirus_pandemic	Hemoptysis	other	18386
11672604	2019–20_coronavirus_pandemic	Worldometer	other	16241
...	...	...	...	...
18439302	Coyote	2019–20_coronavirus_pandemic	other	10
18439298	Roberto_Benigni	2019–20_coronavirus_pandemic	other	10
18436379	State_of_Palestine	2019–20_coronavirus_pandemic	other	10
18439295	Emilio_Salgari	2019–20_coronavirus_pandemic	other	10
18435143	Big_Little_Lies_(TV_series)	2019–20_coronavirus_pandemic	other	10