This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the
current main page.
Trying to produce a list of blocks for users over 5,000 edits with a column indicating [effectively] whether or not the block was lifted, expired, or still active and a column for time between block and unblock [if applicable]. I can pull blocks and unblocks, but navigating durations and timing is more challenging. This is part of a larger project for which poking at the block log is just one component. — Rhododendritestalk \\
21:44, 21 August 2023 (UTC)
I should have time to work on this either tomorrow or (less likely) later today, but if you want to keep poking at it in the meantime, I've taken a glance at what you've tried so far. I wouldn't attempt to parse block_params; instead, look for an unblock in logging with a timestamp later than the latest block. You can use ipblocks_ipindex to see if a user (not just an ip, despite the table name) is currently blocked, which will let you tell if a block without a corresponding unblock has expired. —
Cryptic22:02, 21 August 2023 (UTC)
I'll also have to come back to it tomorrow. My understanding of the various tables (and of SQL) is fairly limited, so I'd appreciate any help you have time for! — Rhododendritestalk \\
01:57, 22 August 2023 (UTC)
Is it important at all that it be done in a single query? I'm thinking it would be easier to post-process this. —
Cryptic20:10, 22 August 2023 (UTC)
@
Cryptic: Not necessarily, no. Certainly "column for time between block and unblock" and a conditional [block date + duration < current date]" is something I can add afterwards. Where it'd be difficult for me (not knowing Python, etc.) is being able to take the full list of blocks and unblocks and automatically pair them up, where applicable, if that makes sense. Like this unblock goes with that block. — Rhododendritestalk \\
20:50, 22 August 2023 (UTC)
Well, that much is easy to do (slowly) - just sort first by the blocked username, then by timestamp; and don't limit log_action to 'block'. Might run into trouble with overlapping blocks and partial blocks or the like, and modified blocks might be an issue, but I expect the overwhelming majority of blocks will have exactly either 1 or 0 unblocks.
quarry:query/76016 is the simplest thing that could possibly work. I'll check in on it later to see whether it completes or times out. —
Cryptic21:11, 22 August 2023 (UTC)
Ha - it did while I was typing that out. Running it again so it sorts usernames-with-underscores and usernames-without consistently; it'll be back up in under five minutes. —
Cryptic21:13, 22 August 2023 (UTC)
What do you want your final output to look like, say, for the first four users at that query ($1LENCE D00600D, $uperFan32, (CA)Giacobbe, and *drew) - they should be fairly representative? Do you still want columns for the blocking admin, block reason, and so on? —
Cryptic21:37, 22 August 2023 (UTC)
This is great, thanks! I'll fork it and poke over the next couple days. Would you indulge me by saying in human terms what this is doing:
UNION SELECT '99999999999999', NULL, 'currently blocked', NULL, user_name, user_editcount, NULL, NULL FROM ipblocks_ipindex JOIN user ON user_id = ipb_user
UNION combines the output of two selects. The later ORDER BY has lower precedence, so it'll sort the combined output.The second SELECT finds all rows in the ipblocks_ipindex table (currently-active blocks -
Special:BlockList is the onwiki interface). JOINing user limits the selected rows to ones with a corresponding row in the user table, ie it eliminates autoblocks and blocks of ips. It also lets us limit the output to users with 5000 or more edit counts in the WHERE clause.UNION requires all of the select clauses it's combining to output the same number of columns, so I used
NULL where there either wasn't data to match the first quarry (such as log_id) or where, even when there's an equivalent in ipblocks_ipindex, it would just repeat data from the first query (such as the blocker, actor_name in the first and ipb_by_actor in the second). user_name, on the other hand, is still included since we sort on that column later. user_editcount could've been nulled as redundant, too, I suppose, but we already had it from joining user, while we'd have to join the actor table to get a name for the blocking admin.'99999999999999' is always output in timestamp column so that it always sorts after a real timestamp (whether sorting numerically or as a string), and 'currently blocked' was picked for clarity in the log_action column. Selecting a constant value like that instead of a table column name just outputs the constant in that column for every row.I'm still willing to do the postprocessing if you let me know precisely which data you need. Getting the data is the easy part; dealing with all the different formats that the logging table has used over the years to indicate duration - if you indeed want the initial duration of modified or expired or still-active blocks - is what's actually hard here. —
Cryptic19:40, 23 August 2023 (UTC)
This is really helpful,
Cryptic, thanks. Ok, so the thing I'm really trying to find out at the moment is to find blocks for reasons related to attacks, harassment, civility, etc., then to compare unblock %s for various edit count ranges (e.g. <10, 10-99, 100-999, 1000-9999, and 10000+). I know there's a big limitation in that we don't store the edit count at the time of the block, granted, but that's the most important. The "nice to have"s would also consider the mean/median starting block duration and mean/median actual block duration (if unblocked), but I appreciate that's harder.
I think I may be able work with the query you've put together here to get what I'll need. I'll have to think about the kinds of conditional statements to make it work. I started to try to adapt it
here, though the results I'm seeing aren't what I expect ("currently blocked" lines without preceding block entries, and missing comment text when I've searched for comment text). Adding: Oops. Accidentally clicked submit again, after it took a while the first time (in case you click the query above shortly I write this). — Rhododendritestalk \\
02:28, 26 August 2023 (UTC)
The reason you're seeing "currently blocked" lines without corresponding blocked lines is that, in the first select, you filter out block log entries that don't mention certain terms; the second select shows all currently-blocked users in the given edit count range. (Be aware you're also filtering out the unblocks that don't match those terms.)There's a field in ipblocks (ipb_reason_id) that works like log_comment_id, so you could, conceivably, filter both halves of the query. Ultimately, though, I don't think that would work very well, since - besides missing the unblocks - for users that were reblocked, it'll be the comment of the last active block, so you'll get cases both where you see the currently-active block and the reblock, but not the original block, and vice-versa. And you're still not going to get the unblock lines - the block log entries for unblocks hardly ever quote the original block. This would only be a little better than what you have now.What you're really interested in filtering by isn't the comment field in the block log of the unblock or reblock entry in the block log, it's by the comment field in the original block. There's probably a more elegant way to do that than
quarry:query/76110, but this way works. What it does is add a column, original_reason, that shows comment_text when block_action = 'block' (that is, a "Admin blocked Vandal for 3 weeks" log entry, not a reblock or unblock or one of the 'currently blocked' fake entries added from the ipblocks table); otherwise, whatever was in the previous row's original_reason column. (It also shows comment_text instead of the previous row's value if the blocked user's name isn't the same as in the previous row as a failsafe.)I did that query for 5000+-edit-users instead of 1-100-edit-users as in your query as I wasn't interested in waiting half an hour for it to complete. I also filtered by regex instead of that list of LIKE clauses, which A) is easier, if you happen to already be familiar with regular expressions, and B) lets us make the comparison case-insensitive, so you'd see a block log entry of "Harassing [[User:Innocent]]", which LIKE '%harass%' would omit. —
Cryptic04:29, 26 August 2023 (UTC)
@
Cryptic: Queried, combined, and poking away now, and noticed something. Look at your query you link above and scroll to A3RO. Note an indef, never unblocked, not expired, and no "currently blocked" line. Do you know what would be causing that? — Rhododendritestalk \\
02:14, 29 August 2023 (UTC)
It's because, when I take the value for original_reason from the previous row instead of the current one, I'm taking that row's comment_text column instead of its original_reason column. And it's nontrivial to take the original_reason column instead. Crapola. I'll try to think of a different way to do it. Selecting all the rows in the query and then filtering by the block reason after post-processing would work for user_editcount >= 5000, but it won't for user_editcount BETWEEN 1 AND 100 - there's just too many results that would get thrown away in the middle to be practical. Probably won't be tonight. —
Cryptic03:00, 29 August 2023 (UTC)
quarry:query/76110 updated. It no longer has the safety check against the previous line's log_title like before, but it seems to work. Running up hard against the limits of my experience here. —
Cryptic04:05, 29 August 2023 (UTC)
I was wondering if it is possible to do a query like this, but with cumulative results for a whole Commons category of images (recursively for subcategories, if possible) rather than a single image. I was going to try to use the categorylinks for this, but did not get very far yet. I am mainly interested in
Category:Images from the National Archives and Records Administration for now. Thanks!
Dominic·
t20:45, 13 September 2023 (UTC)
There's currently 467285 files directly in that category, and 1549115 in the entire tree. Even if they average global usage on a single wiki each (compared to the 95 that the Messier 87 image has), Quarry can't handle that many results. I can write the query for you, but you're going to have to arrange some other way to run it and get the results, especially if I'm to take that "for now" at face value. —
Cryptic21:40, 13 September 2023 (UTC)
@
Cryptic: Thanks, I do appreciate that concern! If you have a good idea about how to do this for a hypothetical category, I can plug in smaller ones or play around with the query on my own, as well. (The "for now" was just meaning, if it's feasible, I could query other categories as well.)
Dominic·
t22:28, 13 September 2023 (UTC)
quarry:query/76609 has results for the tree anchored at
c:Category:Wormholes.It's also going to be slow for files with a lot of usage. One of my tests was with
c:Category:High-resolution or SVG official Wikimedia logos, since the smallish subcats I picked out in the National Archives tree kept coming up with zero global usage; it turns out there's a separate row in globalimagelinks for every usage of every file on every wiki. Which, really, there has to be, but it hadn't sunk in that files like
c:File:Commons-logo.svg have a couple hundred thousand uses in the mainspaces of each of dozens of wikis. It ran for half an hour before I gave up. You're likely to get similar results when running it for categories with large numbers of files, even if they're not in widespread use, or any use at all. —
Cryptic22:43, 13 September 2023 (UTC)
Fixing bare section links
I'm trying to use a regex query to help me do an AWB fixing instances of improperly formatted section links in see also sections
like this. I came up with the query insource:/\*\s?\[\[[\w\s]*#[\w\s]*\]\]/, which seems in testing to properly find the strings, but it's timing out when I try to use it. Any idea how I can simplify the query enough to get it to run? {{u|Sdkb}}talk01:03, 21 September 2023 (UTC)
That's not a query, it's a search string. You're asking in the wrong place, though I'm not aware what the right place would be. To start with, you're using entirely the wrong variant of regexes; neither \w nor \s are character classes, and # is a metacharacter that needs to be escaped.
Help:Searching/Regex might be useful. —
Cryptic01:17, 21 September 2023 (UTC)
Any search for just insource:/whatever/ is likely to time out, because it has to search every page in the selected namespace(s), usually several million. Search performs optimally (i.e. completes at all) with at least one "normal" search term, optionally limited further by insource:, e.g. Apple insource:/banana/.
Certes (
talk)
11:18, 21 September 2023 (UTC)
@
Cryptic @
Certes, thanks both! I didn't realize that the search flavor of RegEx was so different from
the AWB flavor, which is what I was using. I got the search query to work using insource:/\* ?\[\[[A-Za-z0-9 ]+\#[A-Za-z0-9 ]+\]\]/ and am replacing \*( ?)\[\[([A-Za-z0-9 ]+)\#([A-Za-z0-9 ]+)\]\] with *$1{{section link|$2|$3}}. Cheers, {{u|Sdkb}}talk18:16, 21 September 2023 (UTC)
Yes, it is frustrating that the insource regex language is different from the ones used by AWB, AutoEd, etc.
Here's a variant search that also times out but that should keep you busy for a while. It does not search for all possible characters that can appear in titles – see
WP:TITLESPECIALCHARACTERS, which might help you construct a "any character but not these characters" regex – but it's a start. –
Jonesey95 (
talk)
18:20, 21 September 2023 (UTC)
There's a number of especially frustrating things here.
The syntax looks like
PCRE or
POSIX regexes, unlike, say, the flavor used in lua (which is also different, but very visibly so), so it misleads people into googling "how can I make a regex to do X" and thinking it'll work; worse, frequently it does get just enough results to look like it worked, even though the syntax differences means that what was searched for wasn't what was meant to be searched for.
The underlying database engine natively supports PCRE - the de-facto standard variant that everyone expects - but we can't use it in queries because page text isn't copied to the replicas we have access to.
The combination of not implementing ^ or $, having no syntax that matches just a newline, and including newlines in . and complemented character classes makes it verbose and error-prone to express a match that has to occur all on the same line, or one that starts at the beginning of a line - and this particular search really needs to do both.
/^\*.*\[\[[^]|]*#/m would do what you want, if we could use a sane regex engine; it would give bad output for unclosed wikilinks, but those are unlikely to survive long in articlespace anyway. Here, the closest equivalent, I think, is /[^ -]\*[ -]*\[\[[ -\\^-{}-]*\#/, which has the additional drawbacks of not matching at the very start of a page (which won't matter) and treating tabs like they're newlines (which might). —
Cryptic20:10, 21 September 2023 (UTC)
Well done. That sounds better, though titles and anchors can contain characters other than alphanumerics and spaces. If I need a one-off search which times out, I follow it with prefix:A then repeat with prefix:B etc., though not all titles are alphabetic and it might be antisocial of me to do that too often.
Certes (
talk)
18:23, 21 September 2023 (UTC)
@
Sdkb: Taking a step back, how necessary is this task? I don't see any guideline or even essay that prefers {{section link}} to [[Foo#bar]]. As a section symbol, § may be more familiar than # to some readers (though not me personally). However, it might make certain searches harder. For example, I occasionally trawl for links to [[C#something]], which link to sections about letter C but were usually meant for
C♯ (musical note) or
C Sharp (programming language). I never thought to check whether any have been converted to section links.
Certes (
talk)
20:01, 21 September 2023 (UTC)
Yeah, my understanding from that section (and the example of general practice at refined pages) is that § is the preferred form for display to readers (and better display for readers always takes precedence over easier searching for editors). {{u|Sdkb}}talk21:11, 21 September 2023 (UTC)
Fair enough then (though that section also explicitly recommends using a link and doesn't mention the section link template). We should also consider linking to the section via a redirect, in case the text moves into a stand-alone article or elsewhere, but that's obviously not a task that can be automated.
Certes (
talk)
22:40, 21 September 2023 (UTC)
Yeah, what I noticed during the run (about 600 pages) was lots of questionable entries in see also sections (including some stuff that should be navboxes, some stuff that should be links within the article, and some stuff that just should not have been linked. The sample of pages that showed up in the query is probably lower-quality on average than a random sample of see also pages as a whole would be. Still, there's definitely lots of cleanup to do in the area. Doing a systemic review/cleanup is way beyond the scope of what I want to get into, though.
There were also occasional instances where the query didn't pick up all the section links in a see also section (e.g.
here), so a smarter query could potentially pick those up. And there were also a few erroneous entries where someone tried to redlink a title that includes a hashtag, not realizing that such a title cannot exist due to technical restrictions (which I skipped or, if I saved before catching the error, reverted, as
here). {{u|Sdkb}}talk23:58, 21 September 2023 (UTC)
I suspect your "from" AWB regex begins with something other than "\[", meaning that it can include part of any previous adjacent link and prevent that from being amended if it's also to a section. So if it's "(.*)\[\[whatever" → "$1{{section link|whatever" then, as well as being inefficient, the initial .* will grab any previous section link.
Certes (
talk)
10:44, 22 September 2023 (UTC)
Number of articles under protection
Hi. I'm looking into historical trends in the number of articles under each type of protection. Could anyone please point me to or create a query where for a given date, it says "On this date, x articles were under semi-protection, y articles were under full protection, and z articles were under extended-confirmed protection"?
Note that this is not the number of articles for which protection was *applied* on the date; it's the number of articles for which protection was in effect. Thanks in advance and cheers,
Clayoquot (
talk |
contribs)
17:12, 4 October 2023 (UTC)
This is prohibitively difficult to extract from the live database - we'd have to parse not only the protection logs, but deletion and move as well, support all the data formats that have ever been stored in all three of them (which has always been for the convenience of php-based mediawiki, not sql datamining; getting durations is a particular mess), and trace every page that's been protected through moves until either the protection expires, the protection changes, or the page is deleted. That's difficult enough when you're doing it manually online for just one page.It's more feasible to pull it from the dumps, though, particularly if you don't need page namespaces or titles. You'd just need the page_restrictions file for that - it's always been relatively small, and parsing it is simple enough that you wouldn't even need to install a sql engine. You'll be limited to what dates you can get dumps with that file for, of course.If you do need articles in a more strict sense of the word (namespace 0, not a redirect) you're going to need the page file too; that's about 2 gigabytes compressed in the latest dump, and while still possible to parse and match up with the page_restrictions table without a real sql engine, it's a lot less trivial. (If insist on "articles" in the even stricter sense that {{NUMBEROFARTICLES}} uses - namespace 0, not a redirect, has at least one link - you need the pagelinks file too, and definitely a sql engine, and I daresay nobody's going to lift a finger to help you.) —
Cryptic01:14, 5 October 2023 (UTC)
Thank you Cryptic for this thorough explanation! Your detailed answer is very much appreciated. Clearly this is above and beyond the call of duty for volunteers. It might be a good topic for academic research - perhaps
Benjamin Mako Hill would know someone interested in the challenge. Cheers,
Clayoquot (
talk |
contribs)
20:55, 5 October 2023 (UTC)
Don't forget that disambiguation pages are generally not regarded as articles, despite being non-redirects in namespace 0. However, it's rare that they need protection.
Certes (
talk)
21:06, 5 October 2023 (UTC)
As it happens, Mako wrote
a paper about this very topic in 2015 together with Aaron Shaw, and they also published a dataset and code to recreate it:
https://communitydata.cc/wiki-protection/ (link redirects to a current one that for some reason triggers the spam blacklist.) No guarantees that this code still works, of course. Regards,
HaeB (
talk)
04:40, 12 October 2023 (UTC)
external links -- el_to
This query used to work, now it does not. It would print output like:
# adapted from https://quarry.wmcloud.org/query/77092
USE enwiki_p;
SELECT page_title,
page_namespace,
el_to_domain_index,
el_to_path
FROM externallinks
JOIN page ON page_id = el_from
WHERE el_to_domain_index LIKE 'https://nl.kb.%' OR el_to_domain_index LIKE 'http://nl.kb.%';
Well, not so much to take the load off the sql server, but to make it remotely elegant. "Split this string into parts by this character, reverse them, and join them back together" can't really be expressed succinctly in sql; the least ugly way I know how to do it is
like this. —
Cryptic05:12, 12 October 2023 (UTC)
Fantastic, thank you very much. Nice regex statement, 8-level deep domain seems reasonable which allows 5 or 6 hostnames eg. uk.co.bus.host1.host2.host3.host4.host5 . I'll keep an eye out for exceptions. Yes I didn't want to push SQL to do complicated string processing with many statements and temporary variables. Good to know regex is available. The execution time is only about 10% slower with the regex, and I won't be doing this query often. --
GreenC18:43, 12 October 2023 (UTC)
There's plenty of external links that pattern won't match, from [http://www.ncbi.nlm.nih.gov.ezproxy.webfeat.lib.ed.ac.uk/pubmed/6804402] on
User:Myxoma to [http://cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.cache.click.down2crazy.com/] on
Wikipedia:Reference_desk/Archives/Computing/2010 September 7 to [news:comp.lang.ada] on
Talk:Type safety. It wasn't meant to; I tailored it to your existing query, which already requires starting with http(s)://nl.kb., and there aren't any external links that match that and have more than five parts. Matching eight gives some headroom without making the backtracking too ridiculous or needing to resort to the obscure \g{} syntax. I wouldn't infer anything from the relative execution times, either; when the difference is only a second or two, that's going to be dominated by caching, server load, and - particularly when the resultset is large - network transfer of data from the sql replicas to Quarry and inserting the results into Quarry's own database. —
Cryptic20:19, 12 October 2023 (UTC)
Hmm.. well, maybe I'd be better off parsing with my script for unlimited levels, and code for exceptions like
WP:PRURL and other short-hand forms that come up. URLs can get pretty messy. Too bad as I like your solution. --
GreenC21:06, 12 October 2023 (UTC)
Find non-redirects with "(Kanye West song)" in title
Hey there! I'm looking to query pages beginning with Wikipedia:Peer review/ for the users with either the highest number of edits or the highest number of characters added (or both) over the last six months. It would be useful if the query also reported the pages in question for each user. I'd appreciate any help, as I have next to no knowledge of SQL. Thanks! —
TechnoSquirrel69 (
sigh)
02:52, 17 October 2023 (UTC)
@
Jo-Jo Eumerus:quarry:query/77451. There's about 32,000 file talk pages that do not transclude any templates, so I also filtered by pages that have no wikilinks on them (which includes filtering out pages with user page/talk page links from signatures). If you'd like, I can add in pages that contain autosigned comments ("preceding unsigned comment by...") but when I did this I was seeing a lot of false positives.
Anyway, there seems to be plenty of vandalism/test and
WP:NOTFORUM edits in the result set, with some good faith mixed in as well. FYI, you may need to scroll the results to the right on Quarry to see the additional columns: last revision and page size. The results are sorted by the talk pages with the most recent revision appearing first by default.
Uhai (
talk)
16:04, 21 October 2023 (UTC)
Thanks! Aye, there is a lot of vandalism in this namespace that goes undetected - but also a fair amount of germane (if often misplaced) commentary. Thanks for the heads-up on scrolling, too.
Jo-Jo Eumerus (
talk)
16:32, 21 October 2023 (UTC)
Village pump participation
I'd like to know the total number of registered editors who have ever posted to any of the village pumps. I'd like this to be de-duplicated, so that me posting on WP:VPM and VPIL and VPT, etc. only counts as one editor. Is this possible?
WhatamIdoing (
talk)
02:15, 23 October 2023 (UTC)
Precision is not necessary. With either set of numbers, I'm looking at only 1 in every ~550 editors (counting those those who registered an account and actually made at least one edit) having ever posted to any village pump.
Yes, but it's going to be a lot slower. There's 156 pages non-archive Village pump pages in WP: and WT: space, and about 550,000 starting with 'WP:Articles for deletion' or 'WP:Votes for deletion'. The village pumps have a lot more revisions - the query's going to scale by that number as well as the number of pages - but not nearly enough to make up for it. I'll start it running on toolforge where it's less likely to time out. —
Cryptic04:37, 23 October 2023 (UTC)
I'm looking at only 1 in every ~550 editors (counting those those who registered an account and actually made at least one edit) having ever posted to any village pump. I'd suggest that isn't a very useful ratio; 95% of accounts have less than ten edits and the fact that those editors haven't participated on certain pages isn't particularly informative.
So ~60% of the people posting are extended confirmed and 40% aren't. A cutoff that excludes almost half the participants doesn't make me feel good.
WhatamIdoing (
talk)
05:20, 23 October 2023 (UTC)
The cutoff certainly shouldn't be 1 edit, either, though. There's no reason to think someone would find their way to the project namespace (let alone the Village pump in particular) on their first edit, and it's so implausible that it tends to generate suspicion when it does happen. —
Cryptic05:26, 23 October 2023 (UTC)
Looking through
those editors, it seems that most of them are lost - usually looking for the teahouse or helpdesk, sometimes just lost in general. The rest just seem to have something they want to say about/in relation to Wikipedia and aren't interested in contributing further; I suspect that actually suspicious ones are the ones with more than one edit but still a low number.
BilledMammal (
talk)
06:06, 23 October 2023 (UTC)
This doesn't surprise me. Sometimes it's a request for to change an article ("He died. Please update the article"). We used to get complaints about fundraising from new accounts, though that seems less common now (I wrote
User:WhatamIdoing/Fundraising years ago [2011?] in response to one of them).
WhatamIdoing (
talk)
15:38, 23 October 2023 (UTC)
Out of curiosity, I pulled the figures for just 2023:
The ratios are 4, 4, 3.5, and 3. The Community™, no matter what cutoff you use, is more likely to be found (in aggregate) at AFD than at the Village pumps (in aggregate). This is a bit complicated, then, because it's easier to discern The True™ Will of The Community™ in an abstract discussion than by looking at messy individual cases. However, the individual actions (a type of revealed preference) are likely to be more accurate than the stated preferences, and there are many more individuals involved in those individual actions. Obviously, the preferences of people (like me) who self-select for spending a lot of time telling others (like the participants in AFD) what to do are not always going to be the same as the preferences of the people who are doing the work. That's one of the reasons that RFA has long sought evidence that the candidate has at least created a decent article.
Also, I note that the all-time/all-editors ratio is 6, suggesting that either the Village pumps have gotten more popular over time (=my guess) or that AFD has gotten less popular (or a combination of both).
Thank you both for the fascinating numbers. I'd be happy to hear about anything similar at any time (even years from now). Let me see if there's a page where I can record some of these stats for anyone else who might be interested in the future.
WhatamIdoing (
talk)
15:32, 23 October 2023 (UTC)
Mainspace pages with one editor (sorted by createdatetime)
I'd like to see a list of the mainspace pages with only one editor and preferrably including the createdate (looking for the oldest). (In this case, bots would count among editors)
Naraht (
talk)
14:08, 17 November 2023 (UTC)
A full list isn't practical; it would have to look at the majority of the 1.2 billion-row revision table.Finding even the oldest isn't really possible, either, since we've lost most of the history from 2001. On the other hand, just about all the pages that old have also been edited by
User:Conversion script, so they've got at least two editors. Oldest page with only a single known editor is Pythagorean Theorm (
history), created on 2002 Jan 29 and edited only by
193.203.83.xxx. —
Cryptic16:47, 17 November 2023 (UTC)
And somehow the only edit after creation changed nothing... Thanx, I see now querying the revision table would be ugly.
Naraht (
talk)
17:04, 17 November 2023 (UTC)
High traffic talk pages
Apologies if this is already documented in a report somewhere (
this is not what I'm looking for), but is there a way to see how many pages in the Talk: namespace average at least one edit per day over some time period?
Thinking about how I'd structure this query in SQL immediately brings up problems of timeframes where every talk page that's ever been edited qualifies for a one-day moving window, and almost every talk page fails when the window is "days since talk page creation".
Are you familiar with SQL? If so then you could do something like checking for 30 edits in 30 days by ordering the revisions by page,time and use
LAG to check that the last row but 30 was for the same page as the current row and less than 30 days previous. (PARTITION BY page ORDER BY time might work faster, but I have never tested that feature.)
Certes (
talk)
13:06, 25 October 2023 (UTC)
I'd think it'd work almost as well to look for talk pages by length (and no auto-archive template). That's trivial. —
Cryptic13:58, 25 October 2023 (UTC)
@
Folly Mox, I wonder if you would be more interested in talk pages, without archiving, that exceed some reasonable page size (e.g., 100,000 bytes). A brief burst of activity isn't necessarily a problem, so long as the page size doesn't get out of control.
WhatamIdoing (
talk)
22:18, 28 October 2023 (UTC)
@
I Am Andumé:quarry:query/78304. There's quite a large number of articles meeting these criteria so I limited each result set to 1000 records. The first result set consists of articles that are not tagged as stubs but whose talk pages are members of at least one stub-class category. The second result set consists of articles that are tagged as stubs but whose talk pages are not members of any stub-class category.
The former appears to have a total of 1,074,275 total articles while the latter has 178,979. Note that articles with no talk pages aren't included in the results. The size of the latter result set could be reduced if you included start-class categories as well but I don't know if this would be what you are going for.
Uhai (
talk)
09:36, 25 November 2023 (UTC)
Here's a list of the pages in other namespaces, if anyone wants to knock themselves out. Probably wouldn't be a bad idea for a bot task to fix these, especially considering these are only from the stub-class category and I'm sure the other content assessment categories make plenty of appearances in the wrong namespaces as well. There's also redirect talk pages like
Draft talk:2020 in Australia that should either also redirect or have the content assessment removed at the very least. I'm not actually sure what the correct thing to do here is.
Uhai (
talk)
11:59, 25 November 2023 (UTC)
Diacritics are also fairly common and unavoidable in non-English terms. Any idea how to ignore those as well? I'm looking specially for chemical and mathematical formulas. Thanks!
fgnievinski (
talk)
04:37, 12 December 2023 (UTC)
That's more difficult but I can put some more time to get at what you're looking for. If anyone else wants to take a shot at it, that's welcome too.
Uhai (
talk)
04:43, 12 December 2023 (UTC)
@
Fgnievinski Okay, this really sucked, but I created a query using many of the character ranges at
Mathematical operators and symbols in Unicode:
quarry:query/78642. There's likely to be false positives and false negatives but this should hopefully be a good enough start. Note that some apparent false positives like
Michael Ende actually aren't because the character between the birth and death dates is a Unicode minus sign rather than a dash or en dash. Let me know what issues you find and I can attempt to continue to refine.
Uhai (
talk)
07:07, 12 December 2023 (UTC)
Something like
quarry:query/78652 sucks less to do, at least to exclude diacritics rather than look specifically for math symbols. (Nonbreaking spaces aren't as frequent as en-dashes, but there's still nearly 30000 of them and they make it a lot harder to figure out what's going wrong when they show up in the results.) —
Cryptic16:01, 12 December 2023 (UTC)
These results have a lot fewer false positives and nicely identify easily fixed quote marks and dashes. It would be great to have something like this as a daily or weekly report, maybe with a separate daily report for nbsp characters. I suspect that we have a few AWB editors who could make passes through the results and clean up the unambiguous problems pretty quickly. –
Jonesey95 (
talk)
16:58, 12 December 2023 (UTC)
quarry:query/78658 has just shortdescs containing nonbreaking spaces and thin spaces, which I think everyone can agree are incorrect.Are other forbidden unicode characters documented anywhere in any more detail than
WP:SDFORMAT's "plain text – without HTML tags, wiki markup, typographic dingbats, emojis, or CSS display manipulation"? That doesn't seem to even mildly discourage the "A number 𝑥, 𝑦, or 𝑧 that can solve the Markov equation 𝑥² + 𝑦² + 𝑧² = 3𝑥𝑦𝑧" example from the talk page or even the unusual spaces here. Certainly no more than it does en-dashes. —
Cryptic17:54, 12 December 2023 (UTC)
I think this discussion can be wrapped up. Thanks to all for these great queries and other tips. I learned a lot today. If you are interested in continuing the discussion, see
Wikipedia talk:Short description, where I link to two database reports that I have created. –
Jonesey95 (
talk)
21:38, 12 December 2023 (UTC)
Will do, thanks! Would it be easy to select only the deletions of articles older than one year? Most deletions should concern recent article creations, I suspect.
fgnievinski (
talk)
06:03, 3 December 2023 (UTC)
Possible. Depending on how accurate you want it to be, between not easy ("is there any deleted revision dating from a year before the deletion") to hard ("is there any deleted revision dating from a year before the deletion, but after any previous deletions") and very hard ("was there a revision dating from a year before the deletion, after any previous deletions, that might currently not be deleted anymore, or might have been moved to a different title since then). Either way, it'll be very, very slow. —
Cryptic06:07, 3 December 2023 (UTC)
I was hoping the page creation timestamp could be stored in the pages table or in the page properties table. Looking at the revisions table, page creations could be identified with rev_parent_id = 0. Would that speed things up, when trying to calculate the page age at death?
fgnievinski (
talk)
06:32, 3 December 2023 (UTC)
They're not in revision, page, or page_props anymore once they've been deleted, though there's a similar ar_parent_id column in archive. Either way, (rev|ar)_parent_id=0 isn't necessarily the first revision - a non-deleted revision could have a *_parent_id=0 without it being first if it's been history merged, for example - and there might not be a visible *_parent_id=0 at all if the history was split or if the creating edit was revdeleted or suppressed. And it wouldn't be appreciably faster than looking at the timestamp. (The creation log is no help, either, since it's only existed since mid-2018, and it doesn't account for moves between creation and deletion.) —
Cryptic06:41, 3 December 2023 (UTC)
I recently came across
quarry:query/78759, which achieves something related to my original goal, of "separating the wheat from the chaff" or splitting the number of deletions into probably uncontroversial ones and the rest. They've looked at the time difference between user registration and article creation and concluded half of the pages deleted had been created by new users. So now comes the ask: would it be possible to adapt
quarry:query/78459 for counting only the deletions of pages created by users registered more than a week before the page creation -- or, better yet, only by autoconfirmed users (with such status at the time of page creation)? Thanks!
fgnievinski (
talk)
05:00, 17 December 2023 (UTC)
Thank you Cryptic! Is there a way to filter this query to show the monthly counts for deleted articles rather than all types of pages? For this purpose
"article" can be defined as namespace=0 and excluding redirects. I don't mind having disambiguation pages and very short or empty articles included.
Clayoquot (
talk |
contribs)
23:10, 13 December 2023 (UTC)
Namespace 0 yes - add AND log_namespace = 0 to the WHERE clause, if you're comfortable forking the queries yourself - but whether a page was a redirect or not is lost when it's deleted, short of parsing the deleted page's text. Quarry has neither permissions to get at the full information of deleted pages, nor page text even if it hasn't been deleted. It does have access to the length in bytes of each deleted revision, but syncing these log entries up to the correct latest deleted revision is difficult, per my 06:07 3 December comment above. —
Cryptic23:21, 13 December 2023 (UTC)
Ah, of course. Thank you! I made a fork here for "Mainspace deletions by month":
https://quarry.wmcloud.org/query/78694# . For April 2021 it gives an average of 625 deletions per day, whereas
wmcharts for a similar period gives an average of around 250 deletions per day. This seems like a bigger discrepancy than would arise from variations in the definition of "article". Do you have any idea what's going on?
Clayoquot (
talk |
contribs)
00:03, 14 December 2023 (UTC)
Doh, I missed a zero in my calculations. Thanks! The correctly-corrected number for April 2021 is ((18737 - 13157)/30) or 186 deletions per day.
Clayoquot (
talk |
contribs)
17:50, 14 December 2023 (UTC)
You've found some great candidates there. I don't see any bots; ZéroBot has gone now. Perhaps it didn't write enough GAs to pass RfA.
Certes (
talk)
10:00, 19 December 2023 (UTC)
Hehehe. Yes, if you remove all the conditions below AND LOWER(user_name) NOT LIKE '%bot', it will come back though. I am clearly misunderstanding something. –
Novem Linguae (
talk)
10:02, 19 December 2023 (UTC)
Per
wikitech:Help:MySQL queries#Alternative Views, I was about to suggest selecting from recentchanges_userindex rather than recentchanges. However, I tried that and it didn't help performance. LOWER is a pig to use because it fails silently on BINARY strings; the workaround is to convert them first. In fact, once you do that, LIKE works as documented and you don't even need LOWER: the check can be AND CONVERT(user_name USING latin1) NOT LIKE '%bot'Certes (
talk)
10:56, 19 December 2023 (UTC)
Generally you want to avoid correlated subqueries (subqueries that match based on fields in the parent query). This is because each subquery needs to run for each row in the parent query. In order words,
it's very slow. If you are going to use EXISTS with a correlated subquery, then avoid the asterisk and SELECT the table's
surrogate key instead, e.g. recentchanges' rc_id. In fact, for performance reasons, it's good practice to avoid the asterisk for all queries once you are past the stage of noodling around.
Other issues are that you duplicated the user group subqueries when you could have done AND ug_group IN ('sysop', 'bot') and that in the recentchanges subquery you JOINed the actor table when you could have done this in the parent query to improve performance. The latter, again, ties back to not using correlated subqueries to begin with.
It's linked above, twice, at
quarry:query/78816 and
quarry:query/78839.This sort of thing comes up every few years - I'm put in mind of
Wikipedia talk:Requests for adminship/Archive 246#Poll candidate search, for example. One of the problems is that, just as statistics are a really, really poor way to evaluate a candidate at RFA, using these sorts of statistics to find candidates for candidacy at RFA isn't such a great idea. Get accidentally blocked instead of the user you've listed at AIV and then immediately unblocked? Too bad, you're off the list. 9500 of your >10000 edits were automated and performed in the last month, with most of the rest deleted spam from the month before? That's ok, you're still on it, long as you actually picked out your username two years ago. Have you written great content, maybe even have a few featured articles under your belt, but the only time you bothered with a Good Article evaluation first you got a reviewer who put you off for months and then tried to make you jump through meaningless hoops? Too bad. But if all your blocks and failed rfas were before your namechange, you're still in the running. And it goes on and on. —
Cryptic21:07, 19 December 2023 (UTC)
The other methods of finding RFA candidates are also imperfect, and involve things such as "ooh this person seems to have the right personality, let me click on their userpage and dig around more", missing thousands of other good candidates. Or folks posting at
WP:ORCP, missing all the folks who don't post there. Or asking admin friends for ideas of folks to nominate, missing folks the friend doesn't mention. No technique will be perfect. The combination of all these imperfect techniques by multiple nominators is likely to find most of the good candidates though. –
Novem Linguae (
talk)
21:16, 19 December 2023 (UTC)
Stub types not listed on the Great List of Stub Types
I'd like a list of all stub templates (templates whose name ends with "-stub") and all stub categories (categories whose name ends with "stubs") that are not listed on WikiProject Stub sorting's
List of stub types Thanks,
Andumé (
talk)
20:25, 21 December 2023 (UTC)
Never mind, I think I found all of them. That page is miserable to work with (but I suppose that's to be expected; the whole project is redundant makework anyway).
quarry:query/78899. —
Cryptic22:28, 21 December 2023 (UTC)
List of editors for a set of articles
Hello, favorite people who have still not told me to learn SQL myself so I'll quit bugging you (or, more realistically, so I'll have more complicated questions for you),
The main use case at the moment is to find people who are making multiple edits to articles within a group's scope, and then to invite those people to join the group.
WhatamIdoing (
talk)
21:14, 13 January 2024 (UTC)
Define "WikiProject-area pages and discussions" and "an article in the WikiProject's subject area"? I guess the latter is probably mainspace pages whose talk is in
Category:WikiProject Video games articles. If Reports bot's source is available, it's not immediately obvious where it is. —
Cryptic22:22, 13 January 2024 (UTC)
What I'd like is to find editors who edit articles whose talk pages are in the subcats of
Category:WikiProject Medicine articles. For example, could we make a list of editors who have made 10 edits to WPMED-tagged articles in the last 30 days?
For my own purposes, I'd also like to filter the list to remove editors who are blocked (at least if they're indeffed), who probably know about WikiProjects (e.g., those who have already made 10,000 edits) and those who mostly edit other areas (e.g., someone on an AWB run that happened to include some WPMED-tagged articles. WPMED tags a bit less than 1% of articles, so if WPMED-tagged articles are less than 10% of your edits, then you might not be a good candidate for an invitation to WPMED).
"Subcats" is almost always tricky, both in defining what that means - direct subcats, or any in the category tree? How deep in the tree? Subcats only, or also directly in the category (not relevant here)? - and in the query. In this case we've got
Category:All WikiProject Medicine articles, which should be everything. Editcount's easy for users, impractically difficult for ips. Directly-blocked users and ips is easy, but excluding ips that are currently rangeblocked is more trouble than it's worth. Mostly WPMED-tagged, maybe; needs thought.
quarry:query/79597 is most of it. —
Cryptic23:08, 13 January 2024 (UTC)
The "All" cat appears to have an accurate number of articles estimated, so let's go with that.
We can simply exclude the IPs, though perhaps that will get easier when
m:Temporary accounts are rolled out (still estimated for later this year, AFAIK).
Could the "mostly WPMED-tagged" be estimated as the percentage of edits we're counting, divided by the total of all mainspace edits during the time period? This would require counting how many mainspace edits User:Promising made overall last month, in addition to how many to WPMED-tagged articles specifically.
WhatamIdoing (
talk)
23:14, 13 January 2024 (UTC)
quarry:query/79597 again. IPs still there, rangeblocks still not. I expect everything involving ips to become unreasonably difficult once Temporary accounts happens. It certainly won't be easier. —
Cryptic23:24, 13 January 2024 (UTC)
For my purposes related to maintaining a WikiProject, I'd like this to look at any subject page that's in the "All ... articles" category of related talk pages, mainspace or not. I want to know about the work being done on any page under our project's inclusion umbrella.
Stefen Towers among the rest!Gab •
Gruntwerk23:29, 13 January 2024 (UTC)
That's not queryable. It's not reliably searchable either, but you can come close -
this search will find pages transcluding that template where "constituency_no" doesn't appear anywhere in the page source, and
this one will find pages transcluding that template where it does appear, but doesn't have any value other than perhaps spaces or control characters (like tabs and newlines). The problem with both is that they can't guarantee that "constituency_no" doesn't appear elsewhere on the page, perhaps as a parameter of another template, or where this template is used more than once. —
Cryptic18:57, 20 January 2024 (UTC)
It's really superb, but there is no option to export CSV/Excel. I have all the data for Indian constituencies. If there's a chance to add an export option, I can pull the relevant data. -
IJohnKennady (
talk)
09:43, 21 January 2024 (UTC)
Querying Wikipedia articles
I would like to query for Wikidata items that are linked to multiple language editions of Wikipedia, e.g has both Yiddish Wikipedia and English Wikipedia sitelinks.
Another query, for all Yiddish Wikipedia articles that have a link to any other language edition except English Wikipedia. ~ 🦝
Shushugah (he/him •
talk)
22:43, 21 January 2024 (UTC)
Certes, thanks. And yes, I am aware of RandomInCategory, unfortunately it's missing an api. (And I did use RAND() instead of page_random intentionally, but nevermind.) —
Qwerfjkltalk21:31, 31 January 2024 (UTC)
Pages in category that haven't been edited by specific users
Related to
Wikipedia talk:Student assignments#Stats and in the hope that the information will be generally interesting, I'd like to know the number of registered accounts that were created during 2023 (any time during the year) and made an edit (any page, any namespace, deleted or undeleted, any time during 2023), and then to have that number divided according to whether the account has ever been blocked (any time since account creation, including in January 2024). I'd prefer to only count "full" blocks, not partial ones, but if that's too difficult, that's okay. If an account gets blocked a dozen times, I'd prefer to have that counted as "one blocked account" instead of "a dozen blocks were issued".
My goal is to be able to write a sentence like "About 1% of new accounts get blocked" or "Blocks are fairly common, while sitebans are rare. 12,345 new accounts were blocked in 2023, but only about a handful of people were sitebanned during the year".
WhatamIdoing (
talk)
22:11, 2 February 2024 (UTC)
The new set says 48,682/478,209=10.1% chance of getting blocked. So if you merely create an account and never edit, you have a (73,528-48,682)/(1,968,140-478,209)=1.66% chance of getting blocked (presumably some Checkuser blocks in that group), and if you make at least one edit, you have a 10% chance of getting blocked.
Maybe I didn't run the numbers correctly, but my calculation suggests that the risk of a block during 2023 was actually a bit higher for accounts created before 2023.
WhatamIdoing (
talk)
02:20, 3 February 2024 (UTC)
Pulling the list of accounts created from the newusers log, as I initially did and you still do in that fork, isn't going to work right unless you're filtering by time, since it didn't exist before September 2005 and hasn't been backfilled. —
Cryptic03:23, 3 February 2024 (UTC)
Which log should I use instead? (I have pretty good cutting and pasting skills, but...) Did I correctly filter for blocks that were placed during 2023, not accounts that were indeffed years ago?
WhatamIdoing (
talk)
04:36, 3 February 2024 (UTC)
You'd use the user table, if you also cared about when they registered or how many edits they had. But it's huge, and doesn't mix well with the also-huge logging table because it stores spaces in usernames as spaces instead of munging them into underscores, and of course the block logs don't store the user_id (or perhaps directly whether it's a block of an ip or not) like it would in a sane world. So it turns out the only reasonably-efficient way to do this is to look at each block to see if it parses as an ip or an ip range, as in
quarry:query/80167. —
Cryptic07:44, 3 February 2024 (UTC)
That's... a brilliant workaround for a problem that shouldn't exist.
MPopov (WMF), is this a problem for your team as well? I always assume that changing the structure of a log is scary, but this seems like a really odd choice. Maybe it'd be worth it.
(name, not id, but I know what you meant). For 15k it's not going to make a difference. For the not-quite-47 million registered users we have, it did.This is mainly a problem in datamining. In production, it's perfectly reasonable to store the affected user or ip of a historical block into the same field normally used for a page title, and for that matter its details into a string of serialized php data like a:3:{s:11:"5::duration";s:7:"2 weeks";s:8:"6::flags";s:17:"anononly,nocreate";s:8:"sitewide";b:1;} - this isn't information that the wiki actually uses for anything except to display it when someone looks at
Special:Log. For active blocks, all of this data is stored in a separate, easily- and quickly-queryable table. —
Cryptic20:05, 3 February 2024 (UTC)
So if I want to find out how many of my ~15K student-newbies got blocked at any point (including expired blocks), will I need to put all 15K names into the query?
WhatamIdoing (
talk)
06:43, 6 February 2024 (UTC)
Well, you could. Or you could just shoehorn them into the db. Paste them all into a user subpage and link them, and they'll show up in the pagelinks table. —
Cryptic07:26, 6 February 2024 (UTC)
Cryptic, can you update a previous query (from a few months ago, probably when we were talking about village pump participation) to tell me how many registered editors made an edit during 2023? I want to compare that against the 82,865 total blocks (although I suppose some of those were blocked before their first edit, which is a bit of an apples-and-oranges comparison).
WhatamIdoing (
talk)
06:41, 6 February 2024 (UTC)
I don't see anything adaptable in
Wikipedia:Request a query/Archive 4#Village pump participation. We were examining users who'd made edits to specific sets of pages, then once we had those users, examining them to see how many edits they'd ever made. That last number's handily stored as a field in the user table, I believe originally to support autoconfirmed. If you want to know how many users - without any other way to narrow down which users to examine - made any edits in a specific time period, there isn't going to be a significantly faster way than to look at every revision in that time period. There were 60 million in 2023. I'll start some queries running and check them in the morning. No promises. —
Cryptic07:26, 6 February 2024 (UTC)
Would someone be able to search all of the articles on {{Green Bay Packers Hall of Fame}} to provide a list of those article that do not have the string https://www.packers.com/history/hof/ somewhere on their page. If you take a look at
Green Bay Packers Hall of Fame, Packers historian Cliff Christl wrote an article on almost every Hall of Famer for the Packers (referenced in the table). It is a great source and I would like to add the reference to each players' write-up to their Wikipedia article without having to search through every single article (160+ articles, with many already having their write-up referenced on their page). Thank you! « Gonzo fan2007(talk) @ 15:44, 16 February 2024 (UTC)
@
Gonzo fan2007 This got a little messy because of needing to account for archived pages as well, but this should be correct?
quarry:query/80453. It looks like 114 of 167 in the category do not have links matching the string you indicated (note I also included packers.com/history/hall-of-famers/ as this seemed to appear as well). Let me know if you come across issues and I'll try to fix them. FTP!
Uhai (
talk)
17:58, 16 February 2024 (UTC)
I would like to find all new AfC reviewers who recently hit some review threshold (see
this discussion). I'd operationalize this as something like "All users with their first AfC review under 6 months ago and their 25th review within the last week", but if something with a similar effect is technically easier I'm not attached to the specific formulation. @
Novem Linguae directed me here to see if someone could help write a query; they say the typical way to grab AfC reviews, which isn't a formally logged action, is to look at edit summaries, e.g.
here.
Rusalkii (
talk)
00:20, 24 February 2024 (UTC)
To start with, this can't possibly see afc reviews where the underlying page has been deleted (including where the redirect left from draft or, more likely userspace, has been deleted) - edit summaries of deleted revisions aren't exposed to the toolforge replicas. And searching revision, which is necessary if you're looking more than 30 days back, without some way to drastically narrow the search space isn't feasible. "Only users who've had at least one matching edit summary in the last week" may or may not be sufficient, we'll see. These actions should really, really be tagged; go ask at
WP:EFR, if leaving such edit summaries without using the actual tool are to be counted, or change the tool if not. —
Cryptic02:20, 24 February 2024 (UTC)
The
AFC helper script that accepts/declines/rejects drafts makes a variety of edits, including to: drafts, the log page at
Wikipedia:Articles for creation/recent, sometimes a userspace log depending on the user's preferences (example:
User:Novem Linguae/AfC log), and user talk pages. Which edits would be most helpful to tag? Or do you recommend tagging all of them? Also, tags are different than edit filters right? I could just make a tag at
Special:Tags and update AFCH's code, and not need to use an edit filter at all, correct? –
Novem Linguae (
talk)
02:44, 24 February 2024 (UTC)
I'd tag accepts and declines but not comments, log pages, or user talk pages, probably with a separate tag for accepts and declines/rejects. Yes, they're very different; I mention edit filters because they can add tags even if someone doesn't use the tool. —
Cryptic02:59, 24 February 2024 (UTC)
To give you some idea of the scope of the problem,
quarry:query/80638. There's 118 users with at least one vaguely-likely-to-be-an-AFC-review edit in the last week; among them, they have 10.8 million total edits. Each of those edits has to be looked at to see if the edit summary matches. (Well, not quite all - I can exclude edits from before 11 December 2009,
the first revision of what became the AFCH script. But most of them.) Even limiting it to one of the sets of users with lower edit count is going to take a long time.Are the writes to
Wikipedia:Articles for creation/recent optional, or do they always happen? Is there always exactly one edit there per afc review? If so, when did that start? Analyzing its history would be much, much faster even than if there'd been tags from the beginning, and would be able to deal with now-deleted reviews too. —
Cryptic04:23, 24 February 2024 (UTC)
And, as tends to happen, my query on toolforge finished just after I wrote that. The public version at
quarry:query/80639 should finish in about an hour unless the results happened to cache favorably (I don't think that's likely for this sort of query), but as of a few minutes ago the only such user was Broc. —
Cryptic04:41, 24 February 2024 (UTC)
The /recent page is always written to for accepts, but never for declines and rejects. Filtering by user talk namespsace may be helpful for counting afc actions, because everyone always gets a notification, and user talks are never deleted by g13. –
Novem Linguae (
talk)
04:43, 24 February 2024 (UTC)
Yeah, just saw that from analyzing Broc's reviews. Looking at user talk notifications would help some with now-deleted reviews, but still be unusably slow with nothing to narrow them except edit summary. And many, many, many IP talk pages have been badly speedied. —
Cryptic04:56, 24 February 2024 (UTC)
I think it's a bit rare to have more than one tag per piece of software. At this point I am leaning towards having AFCH tag all its edits as AFCH. That could hopefully then be used to speed up quarry queries.
Feature request. –
Novem Linguae (
talk)
11:12, 24 February 2024 (UTC)
Users who meet certain thresholds
Hi all, over at
WP:RFA2024/I someone
proposed we pull administrators from those who meet the following criteria:
At least 10,000 total edits, including at least 5,000 in main space
At least 1,000 edits in the past year, including at least 500 in main space
Account registered at least three years ago
No sanctions within the past five years
At least one featured article or three good articles
Have never lost adminship under a cloud
Is it possible to generate a list of editors who meet at least 1-4? #6 is rare enough that I can check manually. #5 is a bit tricky. I believe each editor with number of GAs is listed at
User:GA_bot/Stats. I believe all users who have brought a featured article through FAC are at
Wikipedia:List of Wikipedians by featured article nominations. I think generating this list would help inform the discussion. My guess is that it's a small list. Thanks a million to anyone up to the challenge!
Ajpolino (
talk)
02:42, 27 February 2024 (UTC)
The short answer is "no".
User:GA bot/Stats is users by the number of reviews performed, not good articles written; the last entry on the list,
ヒストリ案, is illustrative. Similarly, the FA page isn't the users who primarily wrote the articles, it's the ones who took them through the FAC bureaucracy (though there's a high amount of correspondence). And even if they were both good data, there isn't a convenient way to extract the number of articles from the GA page and the FA page includes articles that are no longer featured; those, at least, could be addressed by manually creating similar pages (
WP:List of users with three or more Good Articles and
WP:List of users with at least one still-Featured Article).#4 isn't doable even in principle. Sanctions, in general, aren't logged (in the
Special:Log sense) unless they happen to be blocks.I don't think there's a good list anywhere for #6. We have a couple applicable subpages of
Wikipedia:Former administrators/reason, but lots are mixed into
/resigned. #2 and the second half of #1 are absurdly slow. There's no remotely efficient way to count edits by time or namespace other than to look at all of them. That's barely doable when we're already looking at individual users, or a small list of users with relatively few edits - say, no more than a million total edits, in either case - but when I poked at this briefly when it was first proposed, I ran the numbers just for 10k total edits and 3 years' registration; there's 10,318 users with 441.2 million total edits. #5 will cut that down, but not nearly by enough. —
Cryptic04:33, 27 February 2024 (UTC)
Understood, thank you for looking into it. Would it be possible to get the list of just the first halves of #s 1 and 2, plus #3. I.e. Users whose accounts are at least 3 years old, with at least 10,000 total edits including 1,000 in the past year? I'm guessing even a filter of "have edited in the past year" would cut down your 10,318 users quite a bit, but I'm not sure it's possible or worthwhile to attempt. Thanks again
Cryptic.
Ajpolino (
talk)
13:56, 27 February 2024 (UTC)
That shows, as of this writing, users who've registered less than three years ago. You'll also want to exclude users in groups sysop or bot. —
Cryptic15:18, 27 February 2024 (UTC)
"1000 in the last year" still means having to look at every edit each of those editors made in the last year, and there's too many. "Any edits in the last year" does help - it cuts the number down from 10319 (congrats, whoever you are) to 6176 (
quarry:query/80711)- but the other counts still aren't going to be feasible until we're down to a couple dozen. —
Cryptic15:18, 27 February 2024 (UTC)
Top 500 most redlinked articles?
I would love to see the most red-linked articles. There's a wikispace table for this somewhere, but it has serious problems.
Zanahary (
talk)
07:41, 28 February 2024 (UTC)
Related:
Special:WantedPages - all namespaces, so not very useful.
phab:T37758 to add filtering by namespace to this page exists and is open and awaiting a patch.
Wikipedia:Most-wanted articles - looks like an attempt to list this for mainspace. but the list is outdated by over a year and also appears to be incomplete.
Thanks for this! Yeah, Most-wanted is what I was referring to. It's confusing and clearly just not the most-redlinked articles.
Zanahary (
talk)
07:56, 28 February 2024 (UTC)
If this would exclude transclusions via template! I just want a meaningful result that has red-linked articles by the number of times they were linked to by an editor.
Zanahary (
talk)
08:06, 28 February 2024 (UTC)
Bad decisions made 22 years ago.
phab:T14396 from 2007 is a reasonable place to start looking; it contains a couple indices of the many, many discussions (even then!) begging for this to be made possible. —
Cryptic08:19, 28 February 2024 (UTC)
I believe the link documents that pretty thoroughly? I mean, the workarounds to get a single page's incoming nontranscluded links clearly can't be scaled to multiple pages, let alone all articles. —
Cryptic09:39, 28 February 2024 (UTC)
Leaving out the transcluded-link question, there's no reason someone can't just run Certes'
query, as linked from
WP:MWA, again. The hard part's already done, it just takes someone willing to deal with the tedium of running it for each partition of titles and collating the results - ANDpl_title<"B", ANDpl_title>="B"ANDpl_title<"C", ... ANDpl_title>="Y"ANDpl_title<"Z", ANDpl_title>="Z". —
Cryptic08:21, 28 February 2024 (UTC)
If it's going to be used regularly then we can copy the Quarry query to Wikipedia as a {{database report}}. That has the advantage that anyone can edit it on Wikipedia and re-run it without logging in to Quarry. (Only I can edit the Quarry query, though anyone can copy and paste it.) pl_title LIKE "B%" seems as fast as the >= AND < syntax, and may be more readable. Beware that many titles which appear to be commonly redlinked are actually linked once from a widely used template which contain dozens of similar redlinks, e.g. {{Areas of Chennai}}.
Certes (
talk)
09:40, 28 February 2024 (UTC)
LIKE 'B%' is never going to work for non-alphabetically-starting titles. Granted they're not likely to be widely-linked, but it's not impossible. —
Cryptic09:51, 28 February 2024 (UTC)
True. There are plenty of initials before A (notably 1 and 2 for "2024 in Tiddlywinks" etc.) and after Z. We need to use <= and > for the first and last partitions. We may also be able to get away with something like >="W" for adjacent uncommon initials.
Certes (
talk)
16:07, 28 February 2024 (UTC)
List of old issues (per year) for articles in a WikiProject
Uhai Wow, it works perfectly - thanks! I only needed to make a couple minor adjustments for my use. Check out the results
here. I'm excited I won't need to manually keep up this data. Now if I could wrap my head around the SQL (I never had to use RECURSE before). :) Best,
Stefen Towers among the rest!Gab •
Gruntwerk08:28, 8 March 2024 (UTC)
That bit picks up the FA and GA categories which are not biographies, like
Talk:Wilner Burke which is only in
GA-Class Wisconsin articles. Without it, the rating comes out blank. The {{!}} produces a "|" for regexp alternation; the template prevents it from delimiting the |sql= parameter. If you're copying the SQL into Quarry or some other SQL client that doesn't parse wikitext, you'll need to change {{!}} to | manually.
Certes (
talk)
10:34, 22 March 2024 (UTC)
Thanks. I already knew about {{!}} from all the template work I've done and alternation from all the RegEx I've written. It was the rest that confused me, esp. why there would be interest in just FA/GA outside of bios. At any rate, that part wasn't necessary for my purposes for listing articles that are definitely within one wikiproject.
Stefen Towers among the rest!Gab •
Gruntwerk16:39, 22 March 2024 (UTC)
Fetch a subset of log comments from the last year
quarry:query/81887. The part of my query that limits things to the last year (log_id > subquery) is running too slow and is causing it to timeout. Any suggestions to speed it up? The query works fine when I hard-code a log_id like I did in
quarry:query/81844. But I'd prefer the query to be more dynamic than just hard-coding a log_id. Thanks. –
Novem Linguae (
talk)
13:55, 9 April 2024 (UTC)
quarry:query/81896. Using the indexed tables logging_logindex and actor_logging sped things up a lot. As a side note, since text fields in the replicas are either VARBINARY or BLOB type, the LIKE clause is in fact case-sensitive. If you first convert the fields to UTF-8 then LIKE works as expected. Also, unless you don't want the oldest results to appear first, there's no reason to sort by the log_id ascending since that's already the clustered index of the table.
Uhai (
talk)
03:25, 10 April 2024 (UTC)
Bah. I always forget to add those pesky alternative views. Thank you very much for catching that, and for the other tips. –
Novem Linguae (
talk)
06:10, 10 April 2024 (UTC)
The specialized views helped a little, but what made the real difference here was adding "log_type = 'block'". log_action is a subindex of that; trying to filter by that without log_type is like searching a paper dictionary for words where the second letter is "r". Compare EXPLAIN output
without and
with the change, and see
Database index#Column order. —
Cryptic16:07, 10 April 2024 (UTC)
Women in Red
I'm not exactly requesting a query, but help with one we already have:
Wikipedia:WikiProject_Women_in_Red/Redlinks/C2.
This version ran in 37 seconds. Some time between 20 and 25 February, it stopped working, instead throwing SQL Error: ER_DATA_TOO_LONG: Data too long for column 'subcat' at row 2. This is because it initially populates a table with one row, a fairly short category name, then attempts to add longer names. This used to work but something in the database or engine must have changed, as the same SQL now fails. I've fixed it by seeding the table with a long dummy value which doesn't affect the results, but
that version takes nearly two hours to run. Does anyone know what changed? Is part of the forthcoming categorylinks "upgrade" causing the failure and wrecking performance? (A simple search for a link to details of that change timed out too – perhaps there is a more general temporary performance problem but VPT has nothing, and performance was just as bad last night.)
Certes (
talk)
15:24, 24 March 2024 (UTC)
I poked at this a little when you first asked it, and some more today, and I wasn't able to come up with a version that avoided a full table scan on pagelinks - taken in isolation, the CTE completes instantly, and I can do something like WITHnamesAS(/* the long cte */)SELECTCOUNT(*)FROMnamesJOINpagelinksONpl_namespace=0ANDpl_title=nameGROUPBYname, but every variant of "pl_title starts with name" I could come up with - name=SUBSTRING_INDEX like you used, pl_title LIKE CONCAT(name, '%'), LEFT(pl_title, LENGTH(name)) = name, and so on - was unindexed. Which is why your query taken as a whole is looking through all of page first, when I suspect it was able to make use of the Cn..D range limit before. Irritatingly, putting the CTE results directly in the query instead like
quarry:query/81913does work, but I don't know a workaround to force use of the index when we don't have access to the real pagelinks table, just a view. (And no, I don't know what changed.)Something like SELECTCAST('Feminine_given_names'ASVARCHAR(256)),0 would be cleaner than your long 'This category does not exist but...' dummy value, but obviously that's not the real problem here.What, precisely, are you trying to do? Maybe we can find another way to do it? —
Cryptic17:50, 10 April 2024 (UTC)
Is WHERE pl_from_namespace IN (0, 10) ... HAVING MAX(pl_from_namespace) = 0 meant only to exclude titles linked from any template? The query completes quickly if the first half of that is changed to WHERE pl_from_namespace = 0. Explain output for query 81446 as written (sql-optimizer seems to choke on queries with ctes):
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+-----------------+---------+---------------------------------+----------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+-----------------+---------+---------------------------------+----------+-----------------------------------------------------------+
| 1 | PRIMARY | page | index | PRIMARY,page_name_title | page_name_title | 261 | NULL | 57680411 | Using where; Using index; Using temporary; Using filesort |
| 1 | PRIMARY | pagelinks | ref | PRIMARY,pl_namespace,pl_backlinks_namespace,pl_backlinks_namespace_target_id | PRIMARY | 8 | enwiki.page.page_id,const | 4 | Using where |
| 1 | PRIMARY | <derived5> | ref | key0 | key0 | 258 | func | 1 | Using where |
| 1 | PRIMARY | page | eq_ref | page_name_title | page_name_title | 261 | const,enwiki.pagelinks.pl_title | 1 | Using index |
| 5 | DERIVED | <derived2> | ALL | NULL | NULL | NULL | NULL | 2 | Using where; Using temporary |
| 5 | DERIVED | categorylinks | ref | PRIMARY,cl_timestamp,cl_sortkey | cl_sortkey | 258 | deepcat.subcat,const | 19 | Using where; Using index |
| 5 | DERIVED | page | eq_ref | PRIMARY,page_name_title | PRIMARY | 4 | enwiki.categorylinks.cl_from | 1 | Using where |
| 2 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | UNION | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 4 | RECURSIVE UNION | <derived2> | ALL | NULL | NULL | NULL | NULL | 2 | Using where |
| 4 | RECURSIVE UNION | categorylinks | ref | PRIMARY,cl_timestamp,cl_sortkey | cl_sortkey | 258 | deepcat.subcat,const | 19 | Using where; Using index |
| 4 | RECURSIVE UNION | page | eq_ref | PRIMARY,page_name_title | PRIMARY | 4 | enwiki.categorylinks.cl_from | 1 | Using where |
| NULL | UNION RESULT | <union2,3,4> | ALL | NULL | NULL | NULL | NULL | NULL | |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+-----------------+---------+---------------------------------+----------+-----------------------------------------------------------+
and the same for WHERE pl_from_namespace = 0:
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+------------------------+---------+---------------------------------+----------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+------------------------+---------+---------------------------------+----------+-----------------------------------------------------------+
| 1 | PRIMARY | pagelinks | range | PRIMARY,pl_namespace,pl_backlinks_namespace,pl_backlinks_namespace_target_id | pl_backlinks_namespace | 265 | NULL | 50732398 | Using where; Using index; Using temporary; Using filesort |
| 1 | PRIMARY | <derived5> | ref | key0 | key0 | 258 | func | 1 | Using where |
| 1 | PRIMARY | page | eq_ref | page_name_title | page_name_title | 261 | const,enwiki.pagelinks.pl_title | 1 | Using index |
| 1 | PRIMARY | page | eq_ref | PRIMARY,page_name_title | PRIMARY | 4 | enwiki.pagelinks.pl_from | 1 | Using where |
| 5 | DERIVED | <derived2> | ALL | NULL | NULL | NULL | NULL | 2 | Using where; Using temporary |
| 5 | DERIVED | categorylinks | ref | PRIMARY,cl_timestamp,cl_sortkey | cl_sortkey | 258 | deepcat.subcat,const | 19 | Using where; Using index |
| 5 | DERIVED | page | eq_ref | PRIMARY,page_name_title | PRIMARY | 4 | enwiki.categorylinks.cl_from | 1 | Using where |
| 2 | DERIVED | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 3 | UNION | NULL | NULL | NULL | NULL | NULL | NULL | NULL | No tables used |
| 4 | RECURSIVE UNION | <derived2> | ALL | NULL | NULL | NULL | NULL | 2 | Using where |
| 4 | RECURSIVE UNION | categorylinks | ref | PRIMARY,cl_timestamp,cl_sortkey | cl_sortkey | 258 | deepcat.subcat,const | 19 | Using where; Using index |
| 4 | RECURSIVE UNION | page | eq_ref | PRIMARY,page_name_title | PRIMARY | 4 | enwiki.categorylinks.cl_from | 1 | Using where |
| NULL | UNION RESULT | <union2,3,4> | ALL | NULL | NULL | NULL | NULL | NULL | |
+------+-----------------+---------------+--------+------------------------------------------------------------------------------+------------------------+---------+---------------------------------+----------+-----------------------------------------------------------+
The first doesn't look like it'll look through so many more rows, but it's definitely taking forever, and the table order makes less sense. —
Cryptic20:03, 10 April 2024 (UTC)
Fixed@
Cryptic: Thank you so much. So it needs a STRAIGHT_JOIN rather than a JOIN, to force the tables to be processed in the suggested order. I've put that into C2 and it works, so I'll update the other queries.I've also neatened the table creation as you suggest. I had something similar originally but replaced it by a simpler kludge, because the CAST was the only code I'd changed when the timing went off and I suspected it was causing the problem, perhaps by making columns to be joined have incompatible types.
Certes (
talk)
20:54, 10 April 2024 (UTC)
...and yes, HAVING is to exclude people in templates, who might be linked from hundreds of articles just because their name appears in one navbox.
Certes (
talk)
20:56, 10 April 2024 (UTC)
I'd bet money that what changed is just that pagelinks is incrementally bigger. The optimizer's wildly overestimating the cost of scanning it, so that looking at page AS Pf became more attractive. Removing namespace 10 helped not because the database horrifically couldn't use two disjoint ranges anymore, but because it lowered the estimated cost just enough to scan it first. What confused me last month is that I thought it was looking at page AS Pt instead, and that it was primarily the cte that was meant to cut the number of rows down. —
Cryptic21:06, 10 April 2024 (UTC)
That sounds very plausible. I remember from my SQL-writing days decades ago that query plans would suddenly tip over from the desired order to something unhelpful when data sizes changed. I've applied the fix to the similar reports with other initials, rerun the broken ones and everything's now working. Thanks again.
Certes (
talk)
21:20, 10 April 2024 (UTC)
I've also updated the first chapter of
Men in Red, though currently no one seems interested in its output. There's a biologist there with over 100 links (from actual articles, not navboxes) and articles in
French,
German and
Spanish.
Certes (
talk)
15:21, 11 April 2024 (UTC)
On Toolforge, I got results for
quarry:query/82121, a straightforward update from 81918 above, in 4 seconds, and the query from your diff in 4 minutes 37. They had identical query plans, modulo the extra union from 'This category does not exist...', except that the first estimated 1743728 rows from linktarget (index lt_namespace_title) and the second 3057442. If SDZeroBot's timing out, I'd try with a narrower title range. —
Cryptic21:50, 18 April 2024 (UTC)
Thanks for investigating. We may be able to complete this task before the columns we need are dropped. If not then I'll take your advice.
Certes (
talk)
15:12, 19 April 2024 (UTC)
Redirects connected to a Wikidata item that target a disambiguation page
I attempted this in
quarry:query/82243, but I may well have missed things that would make it more efficient. I’m wondering if the time it takes to run is purely a symptom of having to look through two very large categories, or if anyone’s aware of things that might improve it. All the best :) —a smart kitten[
meow18:14, 24 April 2024 (UTC)
Check for company articles that use both Parent and Owner field.
I'm wondering if it would be possible to check for articles that use both the Parent and Owner field in infobox company. The reason being, I have noticed that many articles incorrectly use the infobox company template to show higher-level ownership. This is not allowed per the infobox documentation. I have fixed this on most pages. However, there are many that I cannot find. WiinterU22:45, 27 April 2024 (UTC)
@
WiinterU: Quarry can't do that, because it doesn't have access to the wikitext.
This search should catch most of the offenders but will contain a lot of false positives, such as articles with parent= and owner= in different infoboxes.
Certes (
talk)
08:35, 28 April 2024 (UTC)
Agreed. According to
my sandbox, ref tags aren't placed into any categories, so we can't search the categorylinks table either. The others that OP mentioned can be queried via the templatelinks table. Or, for new pages, you can use
Special:NewPagesFeed, and tick the filter for "have no citations", which should detect most of these but maybe not all (I forget which ones).
WP:PETSCAN could also be useful since I think it lets you search by templates and/or categories. OP, what direction would you like to go in with this? –
Novem Linguae (
talk)
20:30, 4 May 2024 (UTC)
We could generate a list of pages without any external links at all, or no external links except to specific sites like imdb. That's functionally the same thing - there's no real difference between a formally unreferenced page that has its subject's official site properly placed in ==External links==, and one that plugs it into ref tags instead. —
Cryptic22:10, 4 May 2024 (UTC)
You're looking at quite a lot of pages. For example, there are about 30,000 with titles beginning with A.
[2]. Many of them have a References section with plausible looking citations but no inline refs.
Certes (
talk)
22:56, 4 May 2024 (UTC)
The linked discussion shows the original user was specifically looking for unreferenced pages that aren't already tagged. (I can't imagine why, unless the intent is merely to tag them without actually looking for sources, but whatever.) —
Cryptic07:13, 5 May 2024 (UTC)
Help with expanding the self-transcluded-templates report
Hey. I was wondering if this is even possible and if someone here might be able to help with this. The query in
User:Jonesey95/self-transcluded-templates collects templates that either have zero transclusions or are only transcluded on the template itself. I'd like to extend this to also the sub-pages of the template.
The new logic would be:
Get all templates with number of transclusions less than 6 (5 valid transclusions are: main, talk, /sandbox, /doc, /testcases - of the same template).
Keep in the report all templates that have all of their transclusions only on one of the above pages.
What you ask for in your first paragraph and in your bulleted list aren't the same things, or even terribly similar. Also, while we could literally do what you're asking for in your bulleted list - first generate a list of templates with 0-5 transclusions, and then cull that - but it would be complex, quite likely would be very slow, and possibly wouldn't do quite what you want. Generally it's a better idea to ask for what you actually want, rather than how to get it, and that's especially true with SQL since, since it's primarily a descriptive language, that's what the query-writer does too. (At least, until something goes wrong.)What I'm guessing you're really after is to exclude transclusions on
the template's own talk page, and
either
the template's own "/sandbox", "/doc", or "/testcases" subpages, or
What you ask for in your first paragraph and in your bulleted list aren't the same things I know... The first paragraph is what we currently have at
User:Jonesey95/self-transcluded-templates. What I asked in the bulleted list is what I hoped we could modify it to.
Regarding your second question, B2. Exclude all of a template's own subpages (and talk page). So the finale result would be a database report with templates that have "zero" transclusions (but might have actual transclusions on their own pages).
Gonnym (
talk)
17:09, 29 May 2024 (UTC)
If I've understood correctly: as well as the existing condition tl_from <> page_id, tl_from also has to differ from the page ids of all subpages and any talk page.
Certes (
talk)
17:23, 29 May 2024 (UTC)
No. The transclusion check is meant to reduce the number of valid transclusions to zero. So for example, if template is transcluded only on itself, it then it should be on the report.
Template:Article length bar/L0 was removed, but it shouldn't as it's unused other than itself.
Gonnym (
talk)
17:53, 29 May 2024 (UTC)
Yeah, just realized that I was only removing items from the list that already showed no transclusions. —
Cryptic17:55, 29 May 2024 (UTC)
Looks great! I've browsed it and everything looks good. I'll have a deeper dive into it and if I find something I'll let you know, but so far works as requested. Thank you!
Gonnym (
talk)
20:14, 29 May 2024 (UTC)
How do the patrol/pagetriage-curation logs really work?
Does anyone really know, definitively, what the distinction between (log_type='patrol'ANDlog_action='patrol) and (log_type='pagetriage-curation'ANDlog_actionIN('reviewed','reviewed-article','reviewed-redirect')) is? In particular, why do you variously get one, the other, or both even for patrols by the same person within minutes of each other (example:
both,
pagetriage only,
patrol only); exactly which on-wiki actions result in those cases; and is there a good way to distinguish genuine multiple reviews of the same title without doublecounting the various log_type/log_action combinations (perhaps group by log_page)?
Wikipedia:New pages patrol#Patrol versus review is my attempt to document this. It is confusing. I think the native patrol system has some unintuitive behaviors (maybe autoreviewing in certain cases, maybe being revision-based instead of page-based, not exactly sure, but I sense some differences), that I haven't quite wrapped my head around yet and that make it not correspond 1:1 to reviewing. Also, the original PageTriage authors tried to keep the patrol and review logs in sync but missed some cases such as page moves, so there are some bugs.
phab:T346215,
phab:T337356. Finally, PageTriage reviewing only occurs in the main namespace (and formerly the user namespace), whereas native patrolling with the "[Mark this page as patrolled]" link can occur in any namespace.
When evaluating an NPP's number of reviews, one should use the pagetriage-curation log exclusively because it corresponds to the NPP clicking the reviewed button in the Page Curation toolbar. The patrol log is less accurate. When the Page Curation toolbar is open, as it is for most NPPs unless they close it, the "[Mark this page as patrolled]" link is hidden.
Confusingly, XTools only displays the patrol count, not the reviewed count.
quarry:query/70425 is an old query I wrote to get a count of a person's reviews only. I think I'll go file an XTools ticket to use "reviews" instead of "patrols" for enwiki.
phab:T366397
One other thing. 'reviewed', 'reviewed-article', 'reviewed-redirect'. It used to be only reviewed, then we got rid of reviewed and split it into reviewed-article and reviewed-redirect for easier querying. This split happened about a year ago.
phab:T349048 –
Novem Linguae (
talk)
05:08, 1 June 2024 (UTC)
Yes, CirrusSearch has full-text indices on page titles, while the main db only has normal (prefix) ones.Using the page_name_title index speeds this sort of query up considerably, since titles not matching '%.js' can be filtered out there (even though every title still needs to be looked at, the whole row won't need to be fetched). There's no way to force that, though, since we only have access to views. Sometimes you can fool the optimizer into the indexed search if you say which namespaces you want instead of the ones you don't, even if the list is long, something like WHEREpage_namespaceIN(0,1,3,4,5,6,7,9,10,11,12,13,14,15,100,101,118,119,710,711,828,829), but that doesn't work here. So you're going to have to partition it into several (well, lots of) queries that look small enough that the optimizer uses the index instead of a full table scan - WHEREpage_namespace=0ANDpage_title<'M', WHEREpage_namespace=0ANDpage_title>='M', and so on for the other possible namespaces. —
Cryptic00:38, 22 June 2024 (UTC)
An alternate approach: since you know that there won't be many titles ending in '.js' except in userspace and mediawiki:, find the titles yourself by grepping enwiki-20240601-all-titles.gz from a dump. Link to all of them from a page in your userspace, then you can use pagelinks to find them in a query and check page_content_model and page_restrictions. —
Cryptic04:19, 22 June 2024 (UTC)