Operator: Primefac ( talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)
Time filed: 14:24, Saturday, May 27, 2017 ( UTC)
Automatic, Supervised, or Manual: automatic
Programming language(s): AWB
Source code available: AWB
Function overview: Remove UTM parameters (Google analytics) from external links and references (i.e. resurrect Theo's Little Bot task #23)
Links to relevant discussions (where appropriate): Wikipedia:Bot requests/Archive 55#Remove Google Analytics tracking from external links
Edit period(s): Once a month
Estimated number of pages affected: 16000 in the initial run, and maybe 200 a month after that? Theo's task ran in batches of 500, which also works, but I couldn't then give a timeframe.
Exclusion compliant (Yes/No): Yes
Already has a bot flag (Yes/No): Yes
Function details: Straight-forward find-and-remove. Regex:
\??(?:&?utm_[^=]*?=[^&\s\]\|]*)+(?=]|\s|\|)|(?<=\?)(?:&?utm_[^=]*?=[^&\s\]\|]*)+&
(
test cases)\??(?:&?utm_[^=\s]*?=[^&\s\]\|]*?)+(?=<|}|]|\s|\|)|(?<=\?)(?:&?utm_[^=\s]*?=[^&\s\]\|]*)+&|(?<=&)(?:&?utm_[^=\s]*?=[^&\s\]\|]*)+&
(
tests)As near as I can tell, I've managed to cover all of the edge cases which were of concern in the original BRFA. The blue section covers the case where ?utm_ is followed by an & not followed by another utm_ (e.g. ?utm_example=1234¶=value
). The red hits everything else (i.e. where the utm_ term(s) are only at the end of the URL). Green is when utm falls in between two other codes
In addition to the UTM parameters, there's also "?cmpid", and probably others. DS ( talk) 16:14, 1 June 2017 (UTC) reply
utm_
with cmpid
in the regex.
Primefac (
talk)
18:37, 1 June 2017 (UTC)
reply?mbid
parameter cleanup as well "speedily approved" in lieu of another task as this is low volume. —
xaosflux
Talk
00:29, 7 August 2017 (UTC)
reply