This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the
current talk page.
Would it be useful to have a page where you can test new regexes that will be loaded either with, or instead of, the main typo list, so you can debug live/reduce chances of causing problems to live lists?
I think testing should be done in Find&Replace. However, it would be FKING AWESOME if there was an "export to RETF" feature of Find&Replace once I'm done testing. --
mboverload@17:51, 2 August 2008 (UTC)
Just thinking, it wouldnt be difficult to have it copy it to clipboard as a typo style rule as an option.... —Reedy20:44, 22 August 2008 (UTC)
I think that starting a search string with a zero-width look-ahead and then the desired search string, usually used to exclude certain proper names, is harder on performance than either avoiding the zero-width assertions or using a zero-width look-behind assertion after the desired search string. Putting it at the beginning doubles the effort on things like Tremelo: at each check point (in this case, between every letter), see if Tremelo is the next string; if not, see if tremelo/Tremelo/tremelos/Tremelos is the next string; if so, replace the middle with remolo. I replaced it with a (buggy, but now fixed) version with no zero-width assertions, but the look-behind version would have been: see if tremelo/Tremelo/tremelos/Tremelos is the next string; if so, and it doesn't end with an s, make sure it wasn't Tremelo; if it wasn't, replace the middle with remolo. So the extra check is only made once AWB has gotten a possible match, not on every spot.
There are a couple of other places where a similar change could be made, but I remember some possible problem with the look-behinds and some of the other tools that use this list, so I'd like to open it up for discussion first. --
JHunterJ (
talk)
11:47, 25 August 2008 (UTC)
Hmm, the zero-width look-aheads at the start of a rule are very useful, and performance of the typo list as a whole seems good to me, so I would be cautious about changing them. Could you provide an example of how the Tremelo rule would work with a look behind, as the current rule after your change and my fix looks confusing, though it works correctly? Thanks
Rjwilmsi12:07, 25 August 2008 (UTC)
Something like this:
<Typoword="Tremolo"find="\b(T|t)remelo(s\b|\b(?<!Tremelo))"replace="$1remolo$2"/><!-- don't match the place name Tremelo -->
So, match either T or t, then remelo, then either an s at the end of the word, or if we're already at the end of the word with "remelo", look back and make sure we didn't just see Tremelo. The only time we stop to look around is after we've already matched either Tremelo or tremelo. --
JHunterJ (
talk)
12:18, 25 August 2008 (UTC)
find="\b(T|t)remelo(s)?\b(?<!Tremelo)"
That seems to work just as well (though XML markup is wrong...). If it's true then the change is simply to move all (?!foo)bar to bar(?<!foo), and the question is whether this causes problems for other tools using the typo list?
Rjwilmsi12:44, 25 August 2008 (UTC)
Yours looks behind for "Tremelo" even in the case where we might have found an s at the end of the word. It should be possible to look behind only when the looked-for word could possibly appear, but either should perform better than starting with the look-ahead. --
JHunterJ (
talk)
12:59, 25 August 2008 (UTC)
One other option to use the current version with possibly less confusion:
?| is a "branch-reset" grouping, so each alternative therein should start numbering at 1. (Perl 5.10.0 and later). I can test it this evening (EST) if no one does so before then. --
JHunterJ (
talk)
16:52, 25 August 2008 (UTC)
It looks good for this example but I thought we wanted a general solution/standard. I think my suggestion is the most simple/general so far, but we need some performance data to see if it improves on the current entries using (?!blah).
Rjwilmsi17:14, 25 August 2008 (UTC)
Why does the solution need to be generic? I'd prefer the slight complication of inserting the look-behind only where it can match. --
JHunterJ (
talk)
17:18, 25 August 2008 (UTC)
(?| ... ) is an unrecognized grouping construct, according to the regexp tester in AWB. So the last bit is moot. --
JHunterJ (
talk)
00:09, 26 August 2008 (UTC)
The typo rule standard seems to be to explicitly match all endings of a word when the typo is in the start/middle of a word. It seems to me we could simplify such rules. Example:
Here it's clear that the error is a missing 'r' in the middle of the word and there's no ambiguity about which word this applies to, so the following would achieve the same result (edit summary would stay the same):
I think if we adopted such a convention for such situations (some if not a majority of the typo rules) by using [a-z]+ or [a-z]* we would benefit from: shorter rules, easier maintenance and easier addition of new rules. I would like feedback from others as to whether this seems like a good idea, particularly if there would likely be any performance change to the rules? Thanks
Rjwilmsi12:15, 25 August 2008 (UTC)
Sounds like an idea. Rjwilmsi, AWB has typo profiling... It might be worth me creating a temporary page with a few of these changed rules, and time them against the old version. —Reedy12:43, 25 August 2008 (UTC)
Hmm, I'd have said that within the measurement error there's no difference between the three. Perhaps we should try a longer one, where the advantage of simplification would be greater. Maybe:
I was thinking that myself to be honest. Is it a case of replacing the capture groups with \w+ and [a-zA-Z]+? (just thinking that it would be case sensitive as it is) —Reedy18:03, 25 August 2008 (UTC)
Yes, I would envisage using a \w+ or \w* as appropriate to make suitable rules shorter and more readable, make it easier to add new rules and potentially to catch endings that have been missed to date, while supporting all existing fixes. By using \w+ rather than just cutting off the regex, we will display the complete word changed in the edit summary.
Avoiding false positives on scientific (Latin) names
One of the most common false positives I come across seems to be matching on lowercase words in scientific (Latin) names. An example would be ''Blah carolina'' (what Blah is doesn't matter here). These are matched by rules like \bcarolina\b as the regex \b includes a '. So the rule wants to be \b but not '. I'm struggling to find a neat way to do that beyond an explicit set of [\.,\s-] (since there are many entries that could do with this change). Anybody have any ideas?
Rjwilmsi11:29, 29 August 2008 (UTC)
Add a zero-width negative look-ahead to make sure the next character isn't an apostrophe: \bcarolina\b(?!'). But this will prevent a match on "I forgot to capitalize south carolina's initials." So make it look for two apostrophes: \bcarolina\b(?!''). --
JHunterJ (
talk)
12:37, 29 August 2008 (UTC)
I have been putting these in {{lang|lat|Tuxedo carolina}} templates. But it's not satisfactory. I'd rather have a scientific name mark-up. RichFarmbrough, 19:36
1 September2008 (GMT).
A separate "scientific name" markup would be nice. We still give up catching "I forgot to capitalize bastard out of carolina." with the current solution. --
JHunterJ (
talk)
02:27, 2 September 2008 (UTC)
Typo bug
distictly goes to districtly, when context makes it clear it should be distinctly. Should this go here, or in the main AWB bugs section?
gnfnrf (
talk)
18:57, 30 August 2008 (UTC)
The title of this section is a regex bug I just found. I might come back and fix it myself later, but I'm simply noting it here for now. {{
Nihiltres|talk|
log}}
16:53, 3 September 2008 (UTC)
As far as I know, the Spanish word for "effect" is "efecto". Currently, the typo part of AWB is seeing "Efecto" and suggesting it be changed to "Effecto". I'm not sure if/how other languages are tied into the typo fixing, but we may want to remove this fix.--
Rockfang (
talk)
20:28, 5 September 2008 (UTC)
The idea with foreign text is to use the {{lang|es|effecto}} language tags, then the English typo fixes aren't applied to it.
Rjwilmsi20:39, 5 September 2008 (UTC)
emminent currently corrects to eminent; sometimes it should become imminent instead; hypothetically, it could also be a mistaken immanent. Even though emminent is never correct, we might need to delete it. Or consider ways to eliminate the false fixes. "an emminent" might work to still catch some eminents with little chance of intending imminent, for example. --
JHunterJ (
talk)
16:34, 25 August 2008 (UTC)
There are quite a lot of typos that have had to be rejected for the RETF page because either the correction isn't unambiguous (e.g. 'distict' could be a typo for 'district' or 'distinct', or because it's valid in one context, but not in another e.g. 'Valparaiso' is correct when referring to
Valparaiso, Florida, but should be corrected to
Valparaíso when referring to the city in Chile.
I'd like suggest an enhancement to AWB to help with situations like those. There would be a new 'Ambiguous Typos' list, much like the current 'Typos' list, with entries along the lines of
AWB would read this list and, on finding the RegEx value in an article, would present a panel much like the current link disambiguation panel, for the AWB user to select from the listed replace options.
I had heard that for whatever reason, the diacritics for names like this are being excluded from some pages (e.g.
Montreal Canadiens); I don't think AWB should be making the correction for the accent (despite that I think we should have the accents; consensus overrules my preference). {{
Nihiltres|talk|
log}}
12:44, 11 September 2008 (UTC)
Yes, there's an agreement over at
WP:HOCKEY that players' names don't show accents in the NHL context, because the NHL jerseys don't use them. But they're used in the player's own article. So AWB can't do it as a general fix.
Colonies Chris (
talk)
11:34, 12 September 2008 (UTC)
dispicable
"dispicable" should probably become "despicable", not "despairicable"
I didnt make the change anyway, as it was in quotes on the target page
It has recently come to my attention that AWB recommends a correction of Buddah to Buddha. This is a very problematic correction because of the famous record label,
Buddah Records, often shortened to just Buddah. There are probably a few hundred pages which mention Buddah Records, and because of this I'd like to ask that this correction be removed from the list.
Chubbles (
talk)
16:23, 13 September 2008 (UTC)
Graph looks like a good candidate for wholesale replacement, without trying to identify all the prefixes and suffixes. Can we fix any instance of "grpah", regardless of surrounding letters, or is there a false positive that that would hit? --
JHunterJ (
talk)
20:17, 16 September 2008 (UTC)
Wonderful resource for spelling errors uncaught by AWB
Go look at
History of Ethiopia. I have already manually changed several spelling errors that AWB didn't catch (look at the diff of my AWB edit as well; I manually changed a few there as well). If you keep looking, you'll probably spot more.
Ling.Nut(
talk—
WP:3IAR)01:43, 20 September 2008 (UTC)
The ending "-(s)ible" incorrectly converts "passable/-ably/-ability" to "passible/-ibly/-ibility". I think it's fixable by replacing [Pp](?:[ao]s|lau) with [Pp](?:os|lau), but I'm not confident enough to do it . Am I close? —
SMALLJIM22:15, 4 October 2008 (UTC)
Wouldnt using something like the above, make more sense? Ie do it all case insensitive, and therefore it'll match any of the variations (save having various hardcoded versions). —Reedy10:23, 5 October 2008 (UTC)
My Opera hangs for at least 10 seconds when loading the typos page. This is intolerable, let's take some measures to reduce it. I've already tweaked AWB to use Gzip compression when loading typos, but the list is still huge, and this has no effect on people who maintain or view the list from their browsers.
I've also dropped the requirement for word="foo" attribute to be present in the rules in the next version of AWB, so removing them all will somewhat reduce the size, but will make it harder to understand what a rule is supposed to do.
We could also replace those fancy <syntaxhighlight lang=""> tags with simple <pre>'s - that will reduce the size of HTML output by about 225 bytes per rule, at expence of a bit of human readability.
I wouldn't say it's intolerable. I would prefer to have a large, comprehensive list at the expense of a slow-ish load time than have some arbitrary size limit. There aren't many users who regularly load the page in browser so I'm not sure it's such an issue. Some changes have already reduced the size a little, any further changes to reduce it without cutting functionality/usability are welcome.
Rjwilmsi16:56, 4 October 2008 (UTC)
I agree, it's too big. I'm reluctant to edit it because loading and saving it - even just a section - takes so long. Could it be split alphabetically, or by category, and recombined by AWB when loading it?
Colonies Chris (
talk)
14:10, 7 October 2008 (UTC)
BTW, do you have use the typoscan plugin? Try it if you don't (place the typoscan.dll next to autowikibrowser.exe then edit as normal and look at the typos tab) – for one thing you can then use it to report which regex is wrong. Thanks
Rjwilmsi07:54, 27 October 2008 (UTC)
OK.. first my admission my OED is missing Vols VI, VII, and X so I couldn't check some other hyphenations.
But for "grand-" used as a familial relation quantifier my OED has:
Grand-aunt
Grand-dad and granddad
Grand-daughter
Grandfather
Grandmama
Grandmother
Grand-nephew
Grand-niece
Grandpapa
Grandparent
Grandpaternal
Grandsire
Grandson
Grand-uncle
For "sea" words where the idea of sea is part of the meaning, over 36 pages all compound words (mainly things like sea-fox) were hyphenated except where shown in the following list:
Seafaring and seafarer but sea-fare
Seaman
Sealess
Seamost
Seaport (but example contains hyphen)
Seaquake and sea-quake
Seaside and sea-side
Seaweed
Seaworthiness (but example contains hyphen)
Seaworthy
The following were split into two words
Sea air
Sea cucumber
Sea legs
Sea spider (but example contains hyphen)
This is, of course, not to say that other presentations of these words are wrong.
It is worth asking: What edition of the OED do you have? Hyphenation has most certainly not remained constant over the centuries. "Deluxe", for example, appears in some old texts as "de-luxe". --CrypticC62 ·
Talk04:05, 17 November 2008 (UTC)
Only things invariably are the first: the phrase has inexcusable redundancy and should read "only". But there are very occasional false positive -
First and Only (book), "first (and only the first). Is this annoyance within the scope of AWB regexp magic? --
Tagishsimon(talk)23:43, 12 November 2008 (UTC)
I can create and add a rule for this tomorrow, but it might have to be removed again if there are too many false positives that can't be handled as exceptions within the rule.
Rjwilmsi00:09, 14 November 2008 (UTC)
Thank you; I'm very grateful. The need to remove it if too many FPs is well understood; it'll be interesting to hear if that need transpires. --
Tagishsimon(talk)01:04, 14 November 2008 (UTC)
I'm thinking about small wikis that might copy our rules and don't have the manpower or realize when something is no longer "only" to change it to "first". —
Dispenser19:58, 14 November 2008 (UTC)
Coming here from
this comment. I think many things can be the only without being first -- the thrift shop example I gave, or what if as in the example I cited at Rjwilmsi's page, GSH no longer performs that in the future but instead
Nyack Hospital does. Or if they both do, GSH is no longer the only, but was the first. To me I think it adds ambiguity. StarM16:04, 16 November 2008 (UTC)
I must say, I quite agree with StarM. Dilemma:
Poison xxx cannot be detected by current autopsy techniques. It is, in fact, the only poison that cannot be detected. If, however, there was, at some point in the past, a poison which could not be detected for a significant chunk of history, but which recent medical advancements have made entirely detectable,
poison xxx would be only, but not first. If, on the other hand, by some strange coincidence
poison xxx were the first poison of which we have accurate record of being undetectable, and that undetectability were to somehow remain true unto this day, it would be first and only. While I do agree that in many cases, having both words is often redundant, it seems obvious to me that, depending on the surrounding context, adding first may very well help to clarify things. An anti-redundancy task force may work in this case. AWB would not. --CrypticC62 ·
Talk04:03, 17 November 2008 (UTC)
Also, sometimes we might want to replace "is the first and only" with "was the first and remains the only". As in the case of two institutions, say, where the second one closes, so the first was not always the only. There is also the converse, where the first closes and the second remains the only: "second and only"!--
BillFlis (
talk)
12:42, 17 November 2008 (UTC)
Add me to those who disagree with this regex being there, though the false positive issue is a side matter. I don't think this phrase is necessarily "bad" all the time. I'm sure some people dislike the phrase due to considering it trite, but that's hardly reason to eliminate it entirely. It indicates that by default the event was supposed to be just the "first" one, but something went horribly wrong and it ended being the "only" one as well. Bad usage example:
Eddie Gaedel's first, and only, at-bat was on August 19, 1951. Obviously just "only" suffices here. Decent usage example: Bob chaired the meeting for his first, and only, time on Saturday. (but it's later explained it was such a disgrace he got kicked off the council) This becomes even more clear if you think of examples where the adjective first has become part of the word: "His
First Communion was his only one."
SnowFire (
talk)
00:51, 24 November 2008 (UTC)
I have to disagree also, although I think First and only is not the best phrase to use it will often add clarity and by changing it to only may remove that clarity.
Darryl.matheson (
talk)
01:59, 24 November 2008 (UTC)
you shouldn't fix every false positive that exists
Hi, I just happened to see an edit summary " -> fix false positive: Aroud=name". I'm not picking on that particular edit or editor; please don't be offended. Just in general, you shouldn't fix every false positive that exists. "Aroud" may be a name, but I have never seen it before, and I doubt it's a common one. Let's check: I see 73 ghits for "Aroud" on Wikipedia. Nine of those are in Categories; non-editable. So 64 instances of "Aroud" . I see about four that are probably names and about 60 that are errors.
I agree. If we could account for all the false positives, then we wouldn't need humans to check each typo fix.
What I've been wishing for is a way to flag text as not misspelled. It could be used for foreign words, things that are misspelled on purpose (I've seen a lot of band names and song titles like that), etc. However, I doubt it would be worth developing that, and things would probably get marked that shouldn't.
Auntof6 (
talk)
06:59, 1 December 2008 (UTC)
On review of the recent changes to the article list the only one I would say is incorrect is the change to "(Un)Official" as this removes a common typo when avoiding the false positives. The other changes are only to avoid capitalised names, which seems reasonable to me.
Rjwilmsi08:32, 1 December 2008 (UTC)
As the person who made all these changes, I must disagree. Every change I made avoids more false positives than actual errors. (I checked via a WP search before each change.) As Rjwilmsi pointed out, in the majority of the cases I excluded only the capitalized form of the word. That way nearly all legitimate misspelling are still caught since the word will usually be lower case when not a proper noun. In regards to (Un)Official, the foreign language spelling "oficial" is quite common on Wikipedia. For example, of the first 50 WP search listings there are 44 legit spellings and 6 errors. Of the 6 errors, 4 were oficially which will still be picked up by the separate rule for officially. --
ThaddeusB (
talk)
16:36, 2 December 2008 (UTC)
clarification
The typo page says "Although this project was started with the aim of 100% accuracy, the less accurate but more inclusive list we have now is better." I assume this is supposed to mean that a rule that detects some false positives is OK, as long as such matches are rare. Is that a correct interpretation and what is a good "rule of thumb" as to what these actually means in practical terms. Thanks --
ThaddeusB (
talk)
19:25, 2 December 2008 (UTC)
Your interpretation seems about right. As for "a rule of thumb" I don't think it's really possible to provide one, as searching for a particular typo at any one time to compare false positives and genuine typos is not an accurate measure, because the number of genuine typos fluctuates as users fix errors and others introduce them. Therefore it is difficult to be any more precise than the existing wording you cite. I believe the any false positives should be treated on a case-by-case basis to determine whether there are indeed too many false positives.
Rjwilmsi19:33, 2 December 2008 (UTC)
That typos are underrepresented in any given search did occur to me, since they will constantly being corrected but that correct false positive spellings will (hopefully) stay around. What I was trying to get at is if there was a frequency of occurrence that would make a word definitely out. For example, there are several hundred legit "oficial"s so I took it off, but you seemed to think this was mistake on my part. On the other hand, "teh" is perhaps the most common typo of all, but isn't detected since it also has many legit uses. Sorry if I am being difficult, I am just trying to understand the logic used (if any ;) ). Thanks --
ThaddeusB (
talk)
20:19, 2 December 2008 (UTC)
abolish
I believe the correct noun for the word 'abolish' is 'abolition'. However, I have come across a few instances where 'abolishment' has been used.
Ohconfucius (
talk)
05:23, 3 December 2008 (UTC)
Both are correct and despite their similar form actually have different word origins. In other words, abolition is simply a synonym of abolishment, not a form of the word abolish. --
ThaddeusB (
talk)
05:48, 3 December 2008 (UTC)
word endings question
A number of the word endings regrexes are of the form "\b(\w+)[ending]\b" Unless I'm missing something, the "\b(\w+)" part will be true in every case except " ending ". I would think it would save a good deal of processor if these were changed to simply "[ending]\b" (excluding cases where " ending " actually should be excluded. Am I missing something? --
ThaddeusB (
talk)
03:40, 3 December 2008 (UTC)
Yes, it's the edit summary displayed to other editors: consider the"-solutely" fix. As it stands corrections will show 'typos fixed absolutly --> absolutely' etc. whereas the change you suggest would just show 'typos fixed solutly --> solutely'. A reviewer wouldn't be able to see what word was actually corrected.
Rjwilmsi08:22, 3 December 2008 (UTC)
A quick aside for anyone who thinks these sort of exclusions are silly: I found 3 "Balanciaga"s on wikipedia and all 3 had been "fixed" from Balenciaga using AWB. One was even done by Reedy. (All fixed now.) Not blaming anyone as its is natural to assume to the regrex knows what it is doing and not realize it was actually trying to fix a "version of balance." --
ThaddeusB (
talk)
04:41, 5 December 2008 (UTC)
This should probably be removed, all I have ever got from this rule is false positives on French articles (mostly related to wine, meh). —
neuro(talk)03:08, 6 December 2008 (UTC)
Y The french word should have an accent. According to dictionary.com département is considered valid in English as well. Therefore, I modified to old rule to avoid dropping the e (and still fix other department typos), but added a new one to add the accent.
[3] --
ThaddeusB (
talk)
03:46, 6 December 2008 (UTC)
Ok this was a really freaky error. The text on showed as "included" on
Heribert of Cologne, however the character after the d was NOT an e exactly. If you copy and paste the text into notepad it transforms into "includ-ed" which is (sort of) the way regrex saw it. I guess it is some weird unicode character that looks exactly like an e, but doesn't behave nice. I retyped 'included' on that page and saved
[4] and the problem was fixed. So strange.
P.S. I was only able to fix it because of the screenshot - without it I would've had no clue which article it came from and would have bashed my brains in trying to figure it out. ;) So thanks for that. Next time you make one, click on the typos tab first so it'll show the regrex causing the problem to make it even easier to fix. (I had to load the article to get this info before I figured out it was a problem with the text, not the regrexes.) --
ThaddeusB (
talk)
04:12, 6 December 2008 (UTC)
I have seen this sort of problem before -- I believe it is something to do with a non-Unicode character being pasted into the edit box at some point, which the typo rules will treat as a word boundary.
Rjwilmsi23:19, 10 December 2008 (UTC)
I use a regex that converts \[\[(.*?)\]\], \[\[s1|s2|..|s50\]\] (where the sn are the US states), to [[$1, $2]]. This fixes ambiguous link pairs such as [[Jackson]], [[Mississippi]] (the occasional false positive I just undo manually before saving). However, it fails badly when there a list of states e.g. [[Utah]], [[Nevada]] gets converted to [[Utah, Nevada]]. Is there a way to exclude cases where the first element is also a US state?
Colonies Chris (
talk)
23:36, 10 December 2008 (UTC)
I'll acknowledge this one. I kept getting hits on extra-curricular on AWB/TypoScan, and actually started "fixing" a number of articles. US dictionaries show it as one word, extracurricular, but UK shows as two words, hyphenated. ♪BMWΔ18:03, 13 December 2008 (UTC)
I have removed this "fix" and the one for extra-marital.
OED lists both as valid. Thanks for drawing my attention to this. I have personally "fixed" several dozen "extra-curricular"s without realizing the error. --
ThaddeusB (
talk)
02:25, 14 December 2008 (UTC)
I just added extrajudicial, extramundane, extraordinary, extraposable, extraprovincial, extraterritorial to the extra- rule. All are one word (no hyphen) per
OED. I trust there are no complaint about these? --
ThaddeusB (
talk)
03:04, 14 December 2008 (UTC)
Fixed Hello, I have corrected the naught->nought "correction." It's ironic that someone made it since naught is the more common variation. FYI though, the two words aren't always interchangeable - certain senses require one word or the other. --
ThaddeusB (
talk)
17:26, 14 December 2008 (UTC)
opinion question: "lifeform"
I am considering adding a rule to correct "lifeform" which is not found in OED or Webster's Unabridged. It is, however, commonly written as one word here. OED lists the preferred spelling as "life-form" but also accepts "life form." So the question is should the word be corrected? If yes then correct it to "life-form" or "life form"? (Leaving both hyphenated and spaced words unchanged.) --
ThaddeusB (
talk)
21:07, 14 December 2008 (UTC)
We should definately change the one-word version. Change it to the "preferred", and as noted, let the 2-word version remain. ♪BMWΔ12:20, 15 December 2008 (UTC)
an historic → a historic
I just saw
this edit, and wanted to point out that in many regional pronunciations/dialects of English, the "h" in "history" is silent, so "an historic" would be appropriate and not a typo. This seems like a case of British vs. American English, and it would be like saying colour or colonise were typos. Just wanted to see if this "typo" could be removed from the list (if it hasn't been already). Thanks!-
Andrew c[talk]17:36, 15 December 2008 (UTC)
I eliminated the rule
[9] because there is no consensus on which is correct (a or an). Here is a link to some discussion on the subject for anyone interested:
[10]. Including this from Oxford: "... an was common in the 18th and 19th centuries, because the initial h was commonly not pronounced for these words. In standard modern English the norm is for the h to be pronounced in words like hotel and historical, and therefore, the indefinite article a is used; however, the older form, with the silent h and the indefinite article an, is still encountered, especially among older speakers." And the Google stats: 68% a/32% an.
Well, there is correctness and then there is consistency. Are you voting against the latter? Furthermore, does anybody, anywhere, really say "an history"?--
BillFlis (
talk)
00:08, 17 December 2008 (UTC)
Its not really up to me, you know. :) Clearly someone says "an history" or it would never appear in Wikipedia. :p The real question is does any authority view it as correct? I seriously doubt any does. (Google usage stats are like 95% to 5%) However, it does appear in old book titles and such (it once was correct). These cases *should* be in the form "An History..." so I don't see much problem with having a rule to match "(A|a)n history" --
ThaddeusB (
talk)
03:19, 17 December 2008 (UTC)
Hello, someone recently deleted the sea- rule. I have partially restored it to correct only a few of the previous cases where it should never be hyphenated in modern English. The following is from the archive:
For "sea" words where the idea of sea is part of the meaning, over 36 pages all compound words (mainly things like sea-fox) were hyphenated except where shown in the following list:
* Seafaring and seafarer but sea-fare
* Seaman
* Sealess
* Seamost
* Seaport (but example contains hyphen)
* Seaquake and sea-quake
* Seaside and sea-side
* Seaweed
* Seaworthiness (but example contains hyphen)
* Seaworthy
It's a good start, but the author missed a few unhyphenated by OED: seaboard, seafood, seaplane, and seaward. The new rule corrects those 4, seaman, seaport, seaweed, and seaworthy plus derivatives there of. Most the others in the old rule are hyphenated in OED, but no other major dictionary...
Words frequently used in close association tend to become unified in form as they are in meaning, and ultimately to acquire a single accent. There are three stages in the development of compounds. At first the components of the compound expression are written separately; next they are united by a hyphen; finally, when the separate significance and accent of these components have been lost sight of, they are combined into one word. The hyphenated stage may thus be considered merely preparatory to the coalescence of the various members into one word.
[11]
Considering the last OED was publishing in 1989 I think words like "seabird" and "seacoast" have since reached this final stage of development. (E.g., No other dictionary lists them as hyphenated). I am confident that once current revisions of OED reach S it too will list them as one word. However, until that happens (most likely within the next year or two) I think we have to accept both the hyphenated and unhyphenated forms.
Wait, is this the American Wikipedia or the British Wikipedia here? They're not the same language, you know. I think that would make a big difference in what stays and what goes. Do the Powers That Be recogni(s/z)e this?--
BillFlis (
talk)
00:21, 17 December 2008 (UTC)
(I'm pretty sure you already knew this, but for anyone else reading who might not have known...) Uh, this is the English Wikipedia. We recognize all forms of English (see
Wikipedia:ENGVAR). Hence why we shouldn't auto-correct hyphenated forms to unhyphenated forms unless it the word is written as such in ALL forms of English. There are times where such changes are appropriate (such as articles written about American subjects), however it is not ALWAYS appropriate.
Also the OED doesn't track just British English, but rather "attempt[s] to record a word's most-known usages and variants in all varieties of English past and present, world-wide." (from
OED article.)
In conclusion, removing the hyphen in some cases is similar to (although not as obvious as and probably less controversial than) changing colour to color. --
ThaddeusB (
talk)
02:43, 17 December 2008 (UTC)
P.S. just for kicks, here is OED's variant section for recogni(s/z)e:
The Sc. means Scottish. The 5 means used in the 15th century, the 6 in the 16th and the 6- in the 16th century to present. Thus, both -ise and -ize are acceptable modern spellings. The headword line, however, simply reads "recognize, v.1" which is usually interpreted as what they view as "most correct" (sometimes they list multiple forms in the headword line such as "colour, color, n.1") but according to the
OED usage guide actually "shows the most common modern spelling of the word." (Per policy -ise/-ize words are listed under -ize headwords normally.) --
ThaddeusB (
talk)
03:10, 17 December 2008 (UTC)
Actually, the fruit shouldn't be corrected either since mongos is listed as correct in some dictionaries. The rule has been updated accordingly:
[12]. --
ThaddeusB (
talk)
18:49, 20 December 2008 (UTC)
Correction to "-goes"
I'm apparently not smart enough to figure out the right sequence of parentheses to correct this entry. Error says it has "not enough )'s".
This has been fixed – see entry above. AWB users need to 'refresh status/typos' on File menu to pick up the corrected typo list.
Rjwilmsi11:32, 27 December 2008 (UTC)
For some reason, the typofixer currently fixes "excatly" to "excately" which is really no better; the correct fix would probably be "exactly". I'm not quite sure how to track this down -- it doesn't seem to be the "(In)Exact" correction that's the problem. --
intgr[talk]13:32, 3 January 2009 (UTC)
What false positives would "could of" hit that "should of" and "would of" don't? I've been using that construct in my AWB for a while now, and "of course" is the only false positive I think I've hit. ("of necessity" is perhaps more often set off by commas.) --
JHunterJ (
talk)
02:13, 10 January 2009 (UTC)
I did a wiki search for all three. Between would of/should of here were around 10 of necessity matches - not much, but enough to add the exclusion. I didn't see any of FPs for those. "Could of" is the most common construct, but also generates a lot of false positives: [
[18]]. Most the current matches are FPs in a variety of different constructs. Things like "replicate as much as he could of the observations", "preserved what it could of the professional capabilities", "taking measures to save what she could of the family property", and "relieved them as best she could of the filth". None of these examples are eloquent sentences, but neither do they use an "of" where an "have" should be used. --
ThaddeusB (
talk)
02:40, 10 January 2009 (UTC)
A new rule (or modification of the current one) could be made to correct specific cases - "could of been" for example. --
ThaddeusB (
talk)
02:43, 10 January 2009 (UTC)
I don't see any reason why the first regrex would take so long - might have just been a fluke. How did you get the numbers (I'd be interesting in testing some regexs myself). --
ThaddeusB (
talk)
14:52, 11 January 2009 (UTC)
AWB has a profiling option for the regextypofix under the File menu. Cant actually remember if its only enabled in debug mode... Will check when im back on my main desktop later on today. —Reedy16:14, 11 January 2009 (UTC)
If you don't mind, enabling it would be nice. I don't really know anything about how to get different builds and such. I'm sure I could figure it out... but I'd rather not bother if I don't have to. :) --
ThaddeusB (
talk)
00:45, 12 January 2009 (UTC)
laborious / labourious
How does AWB cope with differing spellings on both sides of the Atlantic? I ask this because several editors have visited
Barry Island Pleasure Park and changed the correctly spelled 'labouriously' to 'laboriously'. Is there any way of preventing these edits? Thanks.
21stCenturyGreenstuff (
talk)
22:01, 11 January 2009 (UTC)
AWB should not change any correct spelling (whether American or British). In this case, labouriously would indeed seem to be an incorrect spelling with the correct one being laboriously (based on laborious), according to my UK English dictionary, which also lists labour. Do you have any reference for labouriously being the correct spelling?
mattbr22:27, 11 January 2009 (UTC)
The Oxford English dictionary does NOT list labourious at all in its definition for laborious. It doesn't list it as an alternative spelling, an archaic spelling, nor do any of the examples use that spellings. It also is not in Webster's unabridged international dictionary. Per above, British news organizations only rarely let it slip through editing. All of this very strongly implies it is a non-standard spelling of the word. Of course, English has no official authority to say its "wrong" per say, but neither is there one to say "labuorious", "laboorious", or "laborius" is wrong either.
A few might accept "labourious", but a few also accept "alot." IMO, Wikipedia should correct to standard spelling when one can be determined. Therefore, I think this correction is perfectly reasonable. --
ThaddeusB (
talk)
00:19, 12 January 2009 (UTC)
Incidentally, the word most probably derives either from the French "laborieux" or the Latin "labōriōsus" and not from "labour" (which itself came from old French "labour"), which would explain the spelling. --
ThaddeusB (
talk)
01:26, 12 January 2009 (UTC)
Word up: feature in next release: Don't fix a typo if the word is in the article title
I remember a while ago there was a good suggestion that a typo shouldn't be applied if it's also a word in the article's title, as this was a source of many false positives that could be avoided on surnames and archaic/unusual spellings etc. Well, using my new powers as an AWB developer I've implemented this feature in the SVN version. If all goes well the feature will be in the next AWB release. Note, other typos will still be applied to the article if they don't match the title. Thanks
Rjwilmsi00:03, 14 January 2009 (UTC)
AWB seems to be replacing Honourary with Honorary. Considering the rules on US/Commonwealth spelling could someone remove this?
Ironholds (
talk)
00:21, 3 January 2009 (UTC)
Unless I'm being thick, neither of those actually make that change (even with my minor tweak to the 2nd one). Another rule? —Reedy00:24, 3 January 2009 (UTC)
Nevermind, didn't notice a fix was made. I am lookin at edits from about 12 hours ago so its probably not a problem now. -
Djsasso (
talk)
15:43, 3 January 2009 (UTC)
To add to the above, the full OED entry for "honorary" lists "honourary" as a derivative spelling that went out of use in the 19th century. No other major dictionary lists that spelling of the word. Therefore, IMO, it is both accurate and desirable to change honourary to honorary. --
ThaddeusB (
talk)
19:45, 3 January 2009 (UTC)
Conclusion: major UK news organizations prefer honorary, but sometime let honourary slip through. 97:1 usage overall, compared to about 30:1 for honour vs. honor.--
ThaddeusB (
talk)
20:11, 3 January 2009 (UTC)
This may also be of interest; the OUP regards it as a spelling error. It may also be a form of
hypercorrection, as it looks like it might be a US/UK English variant, even though it isn't. --
John (
talk)
20:31, 3 January 2009 (UTC)
I love playing with regular expressions. It's fun fun fun. But.. this page.. seems to have become The Regular Expression Game. I think it needs to be shut down. Not deleted... its contents are largely very useful.. but frozen. Folks have been adding pointless fixes for a very long time. They are adding things.. not because the things need to be added.. but because they can add them. And because it's fun to do so.
Ling.Nut(
talk—
WP:3IAR)03:02, 10 January 2009 (UTC)
Do you have something specific in mind when you say "pointless fixes." I've made a good chunk of the recent editions and I stand by every one as being useful. Perhaps you think it is pointless to change "Christmas day" into "Christmas Day", for example? Or "a American" into "an American"? As far as I'm concerned these just as legitimate as a spelling correction. If the spelling/grammar/capitalization/punctuation is wrong, it is wrong. If one is scanning the page anyway why not fix every error you can? An edited, published text would certainly fix these sorts of thing - why shouldn't we? --
ThaddeusB (
talk)
03:20, 10 January 2009 (UTC)
Question from someone with a programming background, but not this kind of programming: At what point do additional entries become a performance issue? As long as it's 1) not too much of a performance issues, 2) the people doing the maintenance can cope with the size, and 3) we aren't trying to change "a xxx" into "an xxx" where "xxx" is every noun that starts with a vowel (and the reverse case), what's the problem? --
Auntof6 (
talk)
05:38, 10 January 2009 (UTC)
I started essentially the same thread back on 1 Dec of last year. You guys need to realize that this page is the equivalent of a header file, and headers should be far more stable. This whole page should also be subject to a top-to-bottom, item by item review to see how many items are truly worthwhile.
Ling.Nut(
talk—
WP:3IAR)05:41, 10 January 2009 (UTC)
I guess that depends entirely on "worthwhile" means. Does it that the error is common? If so, that can be hard to determine because if a user goes through and changes all instances of "some error" that error will appear to be non-existent until another user makes it again. I'm currently working on gathering word-by-word statistics on the last db dump (which is pretty out of date). Maybe that will help determine "useless" rules when I'm done.
Not all expressions require equal processor effort. For example, the endings section tends to be especially taxing since they match EVERY word for a while only to be rejected at the end. This should probably be taken into account - BUT our guidelines call for combining words into generic patterns when possible. So these types of rules are in a way both encouraged and discouraged. Hmmm....
We do have a guideline that says "remove uncommon errors", which I have done on occasion. I will personally keep a closer eye out for rare errors in the future. --
ThaddeusB (
talk)
06:24, 10 January 2009 (UTC)
Yep, I'm having fun adding things. Wikipedia is being improved (more typos are being fixed), so the fixes aren't pointless. I don't think of this as a header file, more like an INI file or configuration file - meant to be being tweaked. So unless there's some revelation of an actual problem, no, I don't think it needs either calming or shutting. --
JHunterJ (
talk)
13:45, 10 January 2009 (UTC)
See below, only like when the execution time is over 60 times more than the next regex, it should be disabled. —Reedy13:49, 11 January 2009 (UTC)
This was a specific request by someone and we already have several similar rules to correct accents on people's names. I really fail to see a problem. I am not specifically seeking out names to correct, but if someone points one out, I fail to see a problem with correcting it. --
ThaddeusB (
talk)
17:24, 25 January 2009 (UTC)
I review every new rule before adding it to make sure that the error actually exists in wikipedia (by searching for the erroneous spelling). Also, I occasionally review previously added rules the same way, and, if the error does not exist, I delete the rule. There has also been a lot of progress in consolidating rules, many of which have been moved to the "Beginnings" and "Endings" sections. And sure, it's fun; you got a problem with that? Why are you here?--
BillFlis (
talk)
20:17, 25 January 2009 (UTC)
(undent) No, nothing is wrong with having fun. Are there performance issues involved? People with slow Internet connections having long waits for info to download? What about performance issues in checking each page against such a heavy set of regexes? Just on the face of it, it makes no sense at all: every page I spellcheck will check for "Kveta Peschke"? I've done thousands of pages, on one occasion, and may do so again...
Ling.Nut(
talk—
WP:3IAR)09:44, 26 January 2009 (UTC)
bug: capitalizes every instance of words beginning with "th"
I just added the RegExp button to wikEd and tried it out. It capitalized every word beginning with "th" - the, them, their, etc. Obviously a bug since most aren't the first words of sentences. I didn't commit these changes, of course. -
Armchair info guy (
talk)
03:08, 19 January 2009 (UTC)
I'd certainly think this is a bug related to wikEd - this certainly doesn't happen in AWB. The rule in question is likely <Typo word="The" find="\b[Tt]He(n?|irs?|re|se|y)\b" replace="The$1" /> which should only capitalize thing like tHe and THere. I'm guessing it is matching case insensitively, which is going to mess up a number of other rules as well. I'm not familiar with the software, but perhaps this is an option that can be toggled? --
ThaddeusB (
talk)
03:56, 19 January 2009 (UTC)
AWB is currently suggesting that restauranteur be changed to restaurateur. At least according to
Wiktionary, both appear to be valid spellings of the word. It may be prudent to remove the word change from the list.--
Rockfang (
talk)
17:19, 29 January 2009 (UTC)
This edit changed "cataloged" to "catalogued", which I think should not be done, because both spellings are acceptable. However, I can't find the rule that would have made the change. Can anybody find the rule, and either fix it (of this was a false positive for a rule that has a legitimate purpose)? —
AlanBarrett (
talk)
09:13, 31 January 2009 (UTC)
Done New rule
added for first problem, second was just non-printing character in middle of word in article, no change to typo rules needed for it. Thanks
Rjwilmsi08:15, 16 February 2009 (UTC)
I've seen AWB catch this typo frequently, however, the solution has never been to make that change. It has always been to just remove the first is (aka "is is" to "is"). Any way that can be fixed? --
Kbdank7120:23, 25 March 2009 (UTC)
Perhaps I wasn't clear. I meant to request undoing that change, as every time I've come across "is is", the correct typo fix is to just drop one is. I have never encountered a situation when "it is" was the correct solution. --
Kbdank7115:23, 27 March 2009 (UTC)
Two common misspellings I've come across that are not in the list
Firstly there's "enoble" (91 article hits), which should be "ennoble".
Then there's "meterorite" which should be "meteorite" however I'm not so sure about this one, it could just be an American/British thing.
I'm very unfamiliar with how to add these, I haven't learned the proper rules/expressions yet and don't want to screw it up so can someone add these please? -- OlEnglish(
Talk)23:22, 27 March 2009 (UTC)
"meterorite" gets only one hit, a redirect to the article with the correct spelling. I think it ought not to have been added.--
BillFlis (
talk)
19:46, 2 April 2009 (UTC)
Does AWB typos extend to dealing with unnecessary phrases such as "Sadly passed" (6,791 hits) "Passed away" (65,434) "sadly passed away" (4,909) and "Sadly died" (6,986), the vast majority of which really want to say "died"?
I keep my own pet-peeve wordy phrases in my replacement list in AWB, but mostly they're not in AWB Typos unless they're wrong (as opposed to just verbose or over-written). --
JHunterJ (
talk)
20:35, 26 March 2009 (UTC)
I removed the passed away additions. As I mentioned, there is nothing incorrect about saying someone passed away; since it isn't incorrect, it can't be corrected. --
JHunterJ (
talk)
23:28, 27 March 2009 (UTC)
I think your removals are not justified by your explanation: "since it isn't incorrect, it can't be corrected". There is clear guideline support for doing away with the death euphemisms, above, in
Wikipedia:Words to avoid#Death and dying. Your "is not typo fixing" does not seem to mesh with
Wikipedia:AutoWikiBrowser/Typos#Incorrect phrases. Like the person who added the phrases, like
User:BillFlis, who probably knows his way around this place with his 2,700 odd contributions, I would wish to keep these. Perhaps you would consider reinstating them. --
Tagishsimon(talk)00:11, 2 April 2009 (UTC)
For what its worth, I agree that at least some of these "corrections" are appropriate. While it may not be technically incorrect to say 'passed away' it is against the style guide, which is a good enough reason to change it as far as I'm concerned. After all, many of our corrections already in use aren't, strictly speaking, "typo fixes."
I would tentatively support the following changes, but likely no others (as I feel other phrases may lead to undesirable changes). However, I could be persuaded against them if they are shown to cause false positives/undesirably changes.
"passed away" (all lower case only) -> "died"
"gave his(/her) life" -> "died"
"died tragically" / "tragically died" -> "died"
If the typo fixing rules can be used to assist in compliance to the agreed style guides then let's do it. Though as ThaddeusB says, if there are too many false positives we might have to remove or restrict the entries just like with any other typo rule.
Rjwilmsi11:06, 2 April 2009 (UTC)
I was not aware that the style guide covered them. Ones that are covered by a WP style guide and avoid false-positive problems, yes, I (no longer) have any objection to them. --
JHunterJ (
talk)
11:35, 2 April 2009 (UTC)
I think that "gave his/her life" has too many possible false positives. A quick search shows it's being used in at least two other contexts: devotion to religion (e.g., "gave his life to Jesus"), and "gave his life new direction". —
TKD::{talk}12:53, 2 April 2009 (UTC)
Homberg changed to Homburg - except there is a place called Homberg
One of the typo fixes changes "Homberg" to "Homburg" (it's buried under endings, search for word="-burg". Fair enough most of the time, except there is a place called Homberg, see
Homberg (Efze). I assumed it was the correct Anglicisation of a German word, but now I suspect it is not.
Mr Stephen (
talk)
23:24, 9 April 2009 (UTC)
Is there a way to tweak the corrections for "answer", so that it doesn't systematically suggest to correct the above, e.g. on
Maharana Pratap Sagar? Generally it's used in one of the species of
Anser (genus)#Living species and taxonomy. -- User:Docu
If née is not preferred (I still think it could be preferred), then we should leave the rule so that it fixes incorrect accenting (e.g., neé) or remove the rule entirely? --
JHunterJ (
talk)
11:02, 13 April 2009 (UTC)
Suggestion for large-scale addition to the typos list
There are many redirects from titles without diacritics to the the correct article title, with diacritics - e.g.
Jerome Bonaparte,
Brunswick-Luneburg. I believe it would be possible to use these redirects to set up regexes to automatically add the missing diacritics wherever the non-diacritic version is used (but I don't have the skills to do it). Here's how I think it could be done:
Is it possible to have the AutoWikiBrowser detect double hyphens between letters (such as "abc--xyz", or spaced like "abc -- xyz") and replace them with correct
em dashes? (see also
MOS:EMDASH) --
bender235 (
talk)
22:20, 17 April 2009 (UTC)
I had a couple of false positives for Welsh place-names when using AWB earlier - it wanted to turn Aberaeron to Aberraeron and Aberafon to Aberrafon. In both cases, the existing spelling is correct. —
Tivedshambo (
t/
c)
22:18, 18 April 2009 (UTC)
I expand communicate to match telecommunicate cases
here (I assume this is want you wanted done). Although, it won't actually match your example since telecommunications actually has one 'l' not two :) --
ThaddeusB (
talk)
03:16, 21 April 2009 (UTC)
Y[29] You are correct 'discernable' is listed in several dictionaries, and thus should probably not be corrected. Interestingly, 'indiscernable' is listed in none. Thus, I left the correction for indiscernable cases only. --
ThaddeusB (
talk)
03:29, 21 April 2009 (UTC)
Oftentimes in athletes' infoboxes there are things like "2x National Champion" or "4x Most Valuable Player". But it should be "2× ..." or "4× ...", respectively, using the
multiplication sign. --
bender235 (
talk)
08:46, 7 April 2009 (UTC)
I've been reverted when making that kind of change on sports pages, because of other editors' preference for the ASCII representation x. --
JHunterJ (
talk)
11:34, 7 April 2009 (UTC)
Where and why? Don't we replace - with – as well, because "p. 12-15" would be wrong (and "p. 12–15" correct)? --
bender235 (
talk)
12:47, 7 April 2009 (UTC)
Hello, you (or, at least, the AWB bot) have been treating "none the less" (three words) as a typo, and changing it to nonetheless (one word).
Most dictionaries say it can be either. The Oxford Dictionary for Writers and Editors (ODWE), which I have always gone to when in doubt, says the three-word version is actually to be preferred (unlike "nevertheless", which is always one word).
It's a very small matter in the great scheme of things, but I think at the very least there is no need to change "none the less" when it appears as three words.
Alarics (
talk)
20:15, 21 April 2009 (UTC)
I don't know whether this should be added as a "general fixes" request, but misspelled fractions like "1/2" or "3/4" should be replaced with ½ and ¾, respectively. That would include ½, ⅓, ⅔, ¼, ¾, ⅛, ⅜, ⅝, and ⅞. --
bender235 (
talk)
16:46, 27 April 2009 (UTC)
1/2 isn't misspelled, but I get your point. There is the possibility for many false positives this way, though, in dates, military unit designations, etc. etc. --
JHunterJ (
talk)
16:59, 27 April 2009 (UTC)
I think we have a guideline NOT to replace these with the Unicode characters somewhere, instead we should use upper/lowercase.
Cacycle (
talk)
12:20, 28 April 2009 (UTC)
It is definitely not a typo fix and probably not appropriate as a general fix either since "1/2" can mean a lot more things than just "one half". --
ThaddeusB (
talk)
00:31, 30 April 2009 (UTC)
If somebody can come up with an extremely reliable set of cases where fractions could be replaced then AWB could do it as a new general fix, otherwise, I think this can't go anywhere.
Rjwilmsi11:26, 30 April 2009 (UTC)
The word in French should be cast within a {{lang}} template, which will enclose it within a span identifying the language and protect it from automatic English-language fixes on the English-language projects. I don't think we wish to remove all strings that are words in other languages. --
JHunterJ (
talk)
00:42, 30 April 2009 (UTC)
I agree. In some cases it could be a misspelling of the English word -- this is, after all, the English Wikipedia. Besides, this kind of thing is the reason that AWB changes are supposed to be checked by a human before being saved. --
Auntof6 (
talk)
05:00, 30 April 2009 (UTC)
fourtunate
Currently this corrects to ffortunate. Not sure if it's worth fixing. -- User:Docu
"Corrected"
here to References of revenue, which is nonsense. This is the second time this has happened; is there some way to encourage AWBers to look before they edit? Can the article be templated to be left alone?
SeptentrionalisPMAnderson22:39, 6 May 2009 (UTC)
I know - what I mean is I loaded the page in my current AWB and it didn't try to make the correction. Presumably, this means the "fix" was taken out or fixed to only match "==Sources==" and not "==Sources XXX== at some point. I would have to guess that the user who made the change is using an older version or something. --
ThaddeusB (
talk)
03:22, 7 May 2009 (UTC)
I suggest you contact the user who made the edit to ask them why it happened. It is not caused by any core AWB functionality.
Rjwilmsi06:44, 7 May 2009 (UTC)
Ah, I see you already have. The user in question just needs to improve their logic to make sure 'sources' is the entire text of the heading, rather than just the start of it.
Rjwilmsi06:48, 7 May 2009 (UTC)
nbsp; before units
I can't see a FAQ around here so... Why is AWB replacing spaces with nbsp; before units? Eg. "12 mm" to "12 mm"?
··gracefool☺15:28, 10 May 2009 (UTC)
So that the unit description doesn't fall on the next line; it will always be right next to the unit value. –
xenotalk15:34, 10 May 2009 (UTC)
We do womens = > women's childrens => children's should we also correct mens? (And maybe oxens, vixens, and sheeps?) RichFarmbrough, 10:24 12 May 2009 (UTC).
Unfortunately, it looks like these errors are ambiguous in that half are incorrect plural forms and half are incorrect possessive forms. Thus a typo rule is probably not ideal. --
ThaddeusB (
talk)
14:35, 12 May 2009 (UTC)
The problem with this is some people are running AWB "skip if no typo fix" and then this non-visible change would be considered a typo fix, effectively causing them to break the rule against insignificant edits. –
xenotalk15:38, 10 May 2009 (UTC)
Are there many other non-visible changes like this? If so, we could make a new "skip non-visible changes" checkbox...
··gracefool☺16:28, 10 May 2009 (UTC)
This change is against MOS (unless it has changed since I last read) since we don't endorse one system of spacing over another (2 spaces in standard in American English). Also, it is completely pointless since most browsers compress multiple spaces into one. --
ThaddeusB (
talk)
17:01, 10 May 2009 (UTC)
It won't make a visible change, and is potentially controversial (though given it's not a visible change that seems a contradiction...), so doesn't seem worthwhile.
Rjwilmsi17:21, 10 May 2009 (UTC)
Indeed many of us prefer the double space after a full-stop even rhough it doesn't show. RichFarmbrough, 10:18 12 May 2009 (UTC).
MOS says there is no guideline because it doesn't matter. But obviously it shouldn't be done by itself since that would be breaking the rule against insignificant edits.
··gracefool☺05:39, 13 May 2009 (UTC)
The supposed "rule" of two spaces after sentence-ending punctuation is not standard "American English", whatever that means. It is a hold-over from the bygone days of typewriters, with their (generally) non-proportional fonts. In type-set text, one space has always been the standard (see, e.g., U.S. Government Printing Office Style Manual, 1973, p. 11: "To conform with trade practice, a single justification space (close spacing) will be used between sentences."--
BillFlis (
talk)
18:10, 17 June 2009 (UTC)
We didn't forget. Is there a Wikipedia style guideline for opting for ae? Should we use æ instead? Should we remove "archeology" from
archaeology? --
JHunterJ (
talk)
00:35, 26 June 2009 (UTC)
There are some typos where Hindenberg should be fixed to Hindenburg. But there is also
Basil Cameron, know as "Basil George Cameron Hindenberg" or "Basil Hindenberg". I don't know how to avoid those false positives. I think an extra rule for Basil will not help as we can't be sure of the order the rules are applied.--
ospalh (
talk)
08:45, 29 June 2009 (UTC)
I don't think there is a regexp to determine which Hindenbergs should be changed and which shouldn't. Both spellings appear to be valid surnames, and people are often referred to by just there surname in article bodies. --
JHunterJ (
talk)
11:19, 29 June 2009 (UTC)
I just used the regexp on its own and most "Hindenberg"s needed to be changed. But there were a few that had to stay. (Most of those did, in a way, mean Hindenburg, too, but were quotes or file names.) So in the end it's too complicated for an automatic rule and should probably not be included.--
ospalh (
talk)
14:27, 29 June 2009 (UTC)
AWB's reg exp typo tab should tell you which regexp was "hit" for this one. In this case, I suspect -olgy --> -ology as a general suffix hit. I don't think it needs to be changed, although possibly an earlier "trilogy" rule that catches "trilolgy" could be added. Since it appears that you
fixed the only instance of "trilolgy" on Wikipedia, I don't think a change is needed. --
JHunterJ (
talk)
15:51, 5 July 2009 (UTC)
Okay, thank you for the link. I'm confused as to why this happened, because for me no typo fixes are applied to the article as I would expect, since the "XBOX" under question is within a template, so is ignored by AWB when applying typo corrections. I can only suppose that the user who made the edit has some customised logic running on AWB that does not implement this standard restriction.
Rjwilmsi22:10, 17 July 2009 (UTC)
I think I figured it out. One of the templates earlier in the article wasn't closed, so it may have voided something. I don't know... Either way thank you for taking note.
BOVINEBOY200822:14, 17 July 2009 (UTC)
Moiré/moire
I'm not sure that the rule for moiré should be kept. I think I've found a false positiv:
Moire (fabric). It's not strictly an error to spell the fabric with an accent, but apparently not standard.--
ospalh (
talk)
07:51, 20 July 2009 (UTC)
Hello! I'm trying to write a regex to no match into links or templates
the example is:
string is : " a [[ b ]] c d [[ d ]] [[c]] "
The match should be only the c outside the links (the bolted one).
Thanks for helping--
Zorlot (
talk)
04:17, 25 July 2009 (UTC)
I use
autoed when i edit, and one of the edits it recommends a lot is deleting the space that is often between the "==" (of the header) and the actual section name.
The proper format for a section header is ==sectionname==,
NOT: ==(space)sectionname(space)==
so this is essentially two rules (as there needs to be one rule for the two equal signs on either side of the page)
This was the application of a suffix rule. (There's a tab in AWB that will display the rules that had matches on the current page; it can be helpful to include that info.) But I don't think it's a prevalent-enough typo to need a separate fix. -- you fixed the only occurrence. --
JHunterJ (
talk)
12:10, 8 August 2009 (UTC)
Hmm, that would be fine by me. I don't think terms like these should be replaced among all the typos (which means that people won't really think about whether "passed away" might be appropriate after all in some situations). --
Conti|
✉19:40, 13 August 2009 (UTC)
Conti, I reverted your removal of "passed away", but I think your argument has merit and should be discussed. I remember the first time I saw the plugin change pass away to die and I was pretty surprised. Afterall, like you and xeno said, "pass away" isn't a typo. The reason I was so quick to revert your change is that I felt removing an entry like that needed some discussion first, and in the meantime, you can just do what I do: ignore the change that AWB wants to make when it comes accross "pass away" in an article.
-shirulashem(talk)19:46, 13 August 2009 (UTC)
Well, if we're all in agreement, what about making that change, then? :) Usually this list is only used for things that are blatantly wrong, and so far I only had to cancel a change because this list wasn't perfect, not because I disagree with it. And I'd like it to stay that way. --
Conti|
✉20:03, 13 August 2009 (UTC)
First of all, don't confuse guidelines with policy. In most cases, "died" is more appropriate than "passed away", I don't disagree with that. But I still disagree with including this entry here for two reasons: a) As I said above, this is the typo list, it contains terms that need to be fixed and are wrong 100% of the time. Which leads to b) "passed away" is usually not appropriate, but not always. Plot summaries come to mind, and of course quotes (or are we supposed to add a [sic] to someone being quoted as "He passed away", like we do with all typos?). Adding this term to
WP:FRONDS instead, which people can use to hunt for badly phrased sentences, sound much better to me. --
Conti|
✉21:14, 13 August 2009 (UTC)
That's why editors need to preview EVERY tool edit before they make them, because there are times that the suggested edit will be wrong. Also, I agree with Tagishsimon. The discussion began, I left my office to commute home, ate a slice of pizza, turned on my computer, and the discussion was over and the change was made. I think it needs to be discussed more.
-shirulashem(talk)00:54, 14 August 2009 (UTC)
My concern is that we're moving from typos to enforcing stylistic changes. Perhaps a different checkbox should be created for this, so editors don't blindly approve the fixes (even though they aren't supposed to). There's a reason the "phrases" section was, until "passed away", empty. It's a bit of a different bird. I'm not particularly fussed though, so if you want to put it back in while more people weigh in, I won't consider it to be edit warring or anything. –
xenotalk00:57, 14 August 2009 (UTC)
(EC with Xeno) And, Conti, you have yet to demolish a couple of arguments: 1) AWB/Typos has for along long time had a section for "incorrect phrases", which seems to indicate an intention to deal with incorrect phrases. According to the guidelines, passed away et al are incorrect phrases. 2) Your own typo and [sic] argument reveals, per Shirulashem, that there are instances where 100% turns into slightly less than 100%; you're probably as likely to get an false positive with a conventional typo regex as you are with this phrase regex. I do take Xeno's point that phrases are a different kind of bird, but am concerned that
WP:FRONDS is to immature to be considered a solution. Like Xeno, I'm happy that we keep passed away removed while we discuss; the discussion is more important than whether passed away happens to be in or out as we discuss. --
Tagishsimon(talk)01:07, 14 August 2009 (UTC)
1) Yes, and as far as I can see, it has been empty
from the day it's been added. Regardless of whether there ever was an intention to use this page to fix incorrect phrases, I simply disagree with the use of this page for that purpose. 2) I disagree here, too. Just do a search for
"passed away" and see how many false positives you can find. There are a lot more than you will find when searching for actual typos. --
Conti|
✉08:52, 14 August 2009 (UTC)
There's a tab in AWB that will show you which rule matched. In this case, I'm betting it was a suffix rule replacing -aly with -ally, which indeed did the expected thing here. Are you suggesting the addition of a new fix to apply to "ocaission"? --
JHunterJ (
talk)
23:18, 18 August 2009 (UTC)
One of the project "to-dos" is to remove rare words. The only instance of "ocaissionaly" has been fixed. I'm postulating that the addition of a fix for it is not necessary. --
JHunterJ (
talk)
21:12, 20 August 2009 (UTC)
Search method
How does AWB search for typos? Does it search the wikisource of the page or actual text that we see on article tab? Thanks! —Preceding
unsigned comment added by
70.26.3.12 (
talk)
00:02, 25 August 2009 (UTC)
Some people keep changing the correct spelling of
liniment to "linament", using AWB in the article
Slough. As you can see there is even an article for it with the correct spelling.
Obviously this must be spelt incorrectly in the Browser, it wouldn't occur otherwise. Please, someone, change this.
Dieter Simon (
talk)
23:13, 3 September 2009 (UTC)
Those should probably be better exempted by wrapping them in {{
lang|es}} templates, since they aren't in English usage (they don't appear in the destination article, for example). --
JHunterJ (
talk)
18:20, 16 September 2009 (UTC)
Done with
this edit. To illustrate what I was saying before, I also blocked the possibility of AWB altering it on one page with
these edits. AWB won't "fix" foreign-language text that's identified as foreign language text by the use of the {{lang}} template. --
JHunterJ (
talk)
19:50, 19 September 2009 (UTC)
In
this edit, an AWB user replaced a "Petersberg" referring to
Petersberg, Hesse into "Petersburg" claiming it was a typo. I'm not sure this spelling should be included in the list; it might do more harm than good, considering that there are several plausible legitimate uses of
Petersberg. —
JAO •
T •
C09:18, 6 October 2009 (UTC)
Ta. I should also mention that advertizing is correcting to advertising. I'd remove the rule myself, but I'm not sure if it's just faulty. --
Closedmouth (
talk)
14:51, 21 October 2009 (UTC)
It took me awhile to find the
external link to the syntax summary on the
AWB home page. For the benefit of those who know the RegExp principles, but are not acquainted with Microsoft's take on it, I suggest
clarifying which elements of the Microsoft mess should not be used, and which ones should be avoided, be it for performance reasons or compatibility or whatnot
making our own summarys for both the quick and dirty and for the advanced messers
Checking the list for expressions that match their output, i. e. matching "foo, then replacing it with "foo".
Overlapping search expressions, e. g. if someone added a rule "ibm" -> "ibn", this would clash with "ibm" -> "IBM".
Crawling redirects, disambiguation pages and AJAX suggestions (from search results) for useful information.
Utilities that convert between regexps and lists of match-replacement pairs, for not-too-complex rules this could save a lot of headaches for beginners, and time for advanced users, and these cases make up the vast majority of rules.
Writing the above it occurred to me that a simple wizard would probably be the simplest solution: You enter a match and/or replacement term, and the wizard shows you whether the matchword already has a replacement, or what words match to a given replacement term. Then, you get to choose the appropriate editing options. Forming efficient regexps can be left to the software. How does that sound?
Being the one who erroneously edited several articles, replacing "an Eulerian" to "a Eulerian" with AWB, I support. --
bender235 (
talk)
11:27, 7 November 2009 (UTC)
someone should add "Mississauga", "Calgary", "New Brunswick", "Nova Scotia", "Prince Edward Island", and "Edmonton"
tablo (
talk)
22:20, 8 November 2009 (UTC)
Oh
I thought that any typo with examples more than two months old was probably not on the list. As a rule of thumb how old would you suggest examples need to be for a typo not to be on the list? ϢereSpielChequers13:33, 11 November 2009 (UTC)
There's no rule of thumb. Typos older than the last time an editor used AWB with RETF enabled are probably not on the list. The way to check to see if it's on the list is to point AWB at the page (with RETF enabled) and see if it catches it. Since AWB usage is human-initiated, not automatic, a page that is five years old but hasn't bubbled up to some AWB editors list won't get corrected. --
JHunterJ (
talk)
13:49, 11 November 2009 (UTC)
False Positives in Sixteenth-Century Titles
Hi, is there a way to stop people using AWB to change sixteenth-century spellings in titles of sixteenth-century books into modern spelling? I keep reverting the corrections of agenst → against, breif → brief, mariage → marriage in the article on
George Joye but people using AWB keep changing it back without thinking or reading the text in context. A note in the Discussion page did not help.
GJ1535 (
talk)
09:41, 11 November 2009 (UTC)
I just scanned an article about Les Miserables, where every occurence of "Rue Plumet" was suggested to be changed to "Rue Plummet". Opinions on the best way to handle this going forward? --
SarekOfVulcan (
talk)
19:14, 25 November 2009 (UTC)
produced a false positive, it tried to fix
"... the most dynamic, action-based of these ..." to "... the most dynamic, action-based on these ...". (from "
Bacone school).
I'm not sure how often "based of" is part of a "<foo> based of <bar>" construction and how often it should be changed to "based on".--
ospalh (
talk)
14:04, 7 December 2009 (UTC)
The replacement yields "iiscernible" as it now stands. Also the word "indiscernible" may exist as a stray on a line above.
LilHelpa (
talk)
01:58, 17 December 2009 (UTC)
There's a regexp tab in AWB that will tell you which patterns hit. (BTW, adding comments to talk pages aren't minor edits.) --
JHunterJ (
talk)
02:11, 21 December 2009 (UTC)
" Womens' " gets changed to " women's' " instead of " women's ". (Spaces added so you could actually see what I was talking about.) --
Closedmouth (
talk)
14:24, 26 December 2009 (UTC)
Capitalization of titles in other languages
A
recent edit at
Nicole Oresme cleaned up a lot of things, but incorrectly changed the word latin to Latin in the title of the following book in French.
Wolowski, ed., Traictié de la première invention des monnoies de Nicole Oresme, textes français et latin d'après les manuscrits de la Bibliothèque Impériale, et Traité de la monnoie de Copernic, texte latin et traduction française (Paris, 1864)
French usage minimizes capitalization, and the lower cased latin was correct. Is there a way to make your capitalization changes language sensitive? Thanks. --
SteveMcCluskey (
talk)
22:01, 1 January 2010 (UTC)
I would have updated the article but you've now removed those references. The answer is to use the {{lang}} template to enclose the foreign-language text. e.g. Smith, F. {{lang|fr|Quelques mots en français}}.
Rjwilmsi22:35, 1 January 2010 (UTC)
Purportrated
The word "purpotrated" gets fixed to "purportrated" by
It should, of course, become "perpetrated". I don't know if it's worth making a new rule for this uncommon typo, but I do think the "Purport" rule should be fixed so it doesn't catch it.
MANdARAX • XAЯAbИAM23:11, 1 January 2010 (UTC)
I've disabled the fix for "à la" because there are lots of false positives (particularly on Spanish/Italian text). If we are to keep the rule we need to fine a more restrictive version with many fewer false positives.
Rjwilmsi17:28, 3 January 2010 (UTC)
Doing a little back of the envelope testing with an English wordlist, I anticipate at least 50 words that if spelled incorrectly, could be transformed nonproductively (like your example), including things like:
desiccant
designed
designator
designer
desirable
The good news is they were spelled wrong before, so a new rule could anticipate those extra S's. I could propose one, but it would require more testing than I can do right now before going live.
I count only a few that should have one s and will be made incorrect, (disidentify, disimitate, disimitation), and about 30 words that the filter will correct (dissimilar, dissipated, dissipation).
Those somewhat strange rules in the middle serve to exclude about 100 or so correct words that would be changed. These include words like disinterested, disinfect, disincline.
One possible solution is to remove r from the range: \b(D|d)isi([a-ko-qs-z]|m[a-nq-z])(\w+)\b (the added part is in bold). This eliminates about half of the problem words, including your example, and only eliminates two of the legitimate corrections. This is at a cost of about 5% of the legitimate corrections.
Again, these are all estimates and don't take into account the frequency with which the words are used, which is a big factor.
Shadowjams (
talk)
06:27, 13 January 2010 (UTC)
There already is a fix for "Design", and it should fix the misspelling above.
It is: <Typo word="Design" find="\b(D|d)[ei]s(?:sigi?n|gin|ing)(s?|ed|ers?|ing)\b" replace="$1esign$2" />
Interestingly enough though, it won't catch these misspelling: desigins, desiginer, desiging. I think that could be added ( add |igin after |ing ) without breaking anything, but I can't test it right now.
Shadowjams (
talk)
07:21, 13 January 2010 (UTC)
Please
check usage beforehand. The variant "Plattform" has numerous legitimate uses, among them
PlattForm Advertising. "plataform" looks good, only exception seems to be
PLATAFORM BL
AWB permanently tries to correct spellings in URLs, like "www.xyz.com/india" -> "www.xyz.com/India". Can this be prevented? --
bender235 (
talk)
19:50, 16 January 2010 (UTC)
E.g.
Concepcion Quetzaltepeque El Salvador. AWB tried to correct "http://www.lonelyplanet.com/worldguide/destinations/central-america/el-salvador/essential?a=culture" to "http://www.lonelyplanet.com/worldguide/destinations/central-America/el-salvador/essential?a=culture" --
bender235 (
talk)
22:36, 16 January 2010 (UTC)
Hmm, if that page were reformatted to use external links or citation templates the typo fixing would know to leave the URLs alone.
Rjwilmsi23:10, 16 January 2010 (UTC)
Yes, but
WP:Year doesn't mention succession boxes. Additionally sometimes full dates are used in succession boxes (for example in articles about music albums or about boxes), which wouldn't come under
WP:Year, but perhaps rather under
MOS:DOB. Finally
MOS:DASH allows exceptions in lists, so why should'nt it also apply for succession boxes ? (Yes I know that succession boxes are an odd version of a list). If I have not convinced you, then please consider this as settled. Best wishes
~~ Phoe talk ~~ 20:48, 20 January 2010 (UTC)
Yes, date ranges are different to year ranges.
WP:DASH perhaps has a clearer explanation of why. AWB is not removing spaces in date ranges, only year ranges. The
WP:DASH exception for lists is the extra use of endahses, not an exception to allow year ranges to be spaced.
Rjwilmsi21:16, 20 January 2010 (UTC)
That expression works, but it's not a common typo. Scanning the November database dump I only find that misspelling used in 4 articles, 2 of which are intentional, and 2 of which I corrected. The two misspellings were added by one editor. I'm going to hold off adding it.
Shadowjams (
talk)
01:55, 25 January 2010 (UTC)
Do you have any indication that there is another rule that generally handles this, but didn't in this case? I can't find (in a very quick search, admittedly) a rule that would have matched this. I'll work on a new one, but if there's an old one that should have gotten it, knowing that would be very helpful.
Shadowjams (
talk)
08:26, 29 January 2010 (UTC)
Ok, this should work. I don't want to add it in quite yet because I haven't tested it very much, but feel free to add it to your add/replaces, and if you don't see any problems then go ahead and add it to the typo list.
I'm not 100% that "intitled" is a typo, the dictionary references I looked up were a little unclear. But I don't think it's a problem edit either. In most cases "entitled" is going to be more right than "intitled", although I wonder if there are cases where "intitled" is correct. I'm not sure.
The other downside, I can't offhand think of a way to keep the case correct while transforming letters, so you'll need two rules, one for "Intitled" and one for "intitled". Just change the first letters, respectively. This one should also catch a simple transposition or deletion in the middle (the most likely typo).
I'm finding a lot of English language quotes, particularly in legal opinions, from the 1800s and before use "intitled". Perhaps we need to make sure any edit doesn't change a quote.
Shadowjams (
talk)
08:59, 29 January 2010 (UTC)
There are ways to exclude quoted statements like this, but all of them that I'm coming up with right now are pretty processor intensive. There might be a way to creatively limit this, at some expense of type 2 errors, that is less processor intensive. I might revisit it at another time. I would recommend against using the above regex unless you're extremely careful you're not changing a quote.
Shadowjams (
talk)
09:10, 29 January 2010 (UTC)
Paradoctor - That is what I found, more or less as well. I don't think there's a problem converting modern text, but we certainly don't want to alter any quotes that use it. Because AWB uses the .net regex library there are some non-greedy expressions that aren't possible in most other regexes that might fix this nicely... but I'm concerned that most solutions will eat a lot of processing power. If some others have ideas I'd like any advice.
Shadowjams (
talk)
09:14, 29 January 2010 (UTC)
AWB does not apply the typo fixing rules within templates e.g. {{cquote}} or within quote marks e.g. " and all the common variations.
Rjwilmsi09:38, 29 January 2010 (UTC)
I'm not sure I understand your question. I'll explain my answer again in more detail in the hope it does answer your question: when AWB executes a typo rule from the
WP:AWB/T list it first hides the quotes then applies the typo regexes, then unhides the quotes again. If you apply the regex by other means you will not get this quote hiding (unless you write a custom module to access the functions).
Rjwilmsi09:55, 29 January 2010 (UTC)
Sorry for the confusion. That wasn't very clear. You understood what I meant though. I believe, in that case, that the above should fix what the OP was talking about. Of course, the question of whether or not the i version is appropriate in the modern context is still open, although I would assume not especially controversial.
Shadowjams (
talk)
10:14, 29 January 2010 (UTC)
E.g.
The rule for “e.g.” (currently fourth among new additions) adds left bracket, for example “eg.” → “(e.g.”. This should be fixed by removing the bracket.
Svick (
talk)
04:08, 30 January 2010 (UTC)
I originally put it there, and then its structure was changed, and then
User:Marek69 disabled it, then made some changes and renabled it. The original one had a leading ( because the overwhelming majority of examples I found were at the beginning of parentheticals, which makes sense when you consider how people use the abbreviation. It is probably adding it because it was removed by Marek without changing the corresponding output.
I had tested the first version and was reasonably confident it didn't have many (I never found any) false positives. I cannot say the same about this new version. I am going to revert it back to the earlier version with a note. If someone wants to test it and change it that's fine too, but I think we're seeing some problems with it right now.
Shadowjams (
talk)
22:38, 30 January 2010 (UTC)
Another small question. Is E.g. ever proper in the Manual of style? (compared to e.g.). I don't know the answer, but wanted to bring it up.
Shadowjams (
talk)
22:41, 30 January 2010 (UTC)
The last version didn't work again (changed “eg.” to “(e.g.”, but didn't change “(eg.”), so I disabled it. Before it is turned on again, please make sure it works as it should.
Svick (
talk)
23:36, 30 January 2010 (UTC)
Looks fixed now. My mistake for not noticing that Marek's change was correct; the simplification is where it caused the problem.
That appears to be a result of the "-ining" regex, which is (?!\b(?:(?:Br|Kl|M|H|St)e|Nar|Kurt|Lap)inig\b)\b(\w+)inig(s|ly)?\b. I don't see any systematic way to fix this class of typos without interfering with the others. In other words, "inig" that should be "ing" are virtually indistinguishable from "inig" that should be "ining". If someone has some way to distinguish the two that would be useful, but I can't think of one right now.
Using the "-escent" rule, AWB changes "floresent" to "florescent". Although that is a valid word, the more likely intended word is "fluorescent". A wiki search for "fluorescent" produced 1042 articles, and "florescent" found 32 pages. For those 32, I fixed the incorrect usages, discovering that all except 3 were actually intended to be "fluorescent".
MANdARAX • XAЯAbИAM21:31, 9 February 2010 (UTC)
I've expanded the "Fluoresce" rule and removed "|[Ff]lu?or" from the "-escent" rule. I excluded "florescent" and "florescence" from "fixing" as they are correctly spelled words; however, as noted above, they're extremely rare on Wikipedia and the "fluo..." word is almost always the intended one, so if anyone thinks it's better without the exclusion, feel free to remove it.
MANdARAX•XAЯAbИAM04:09, 21 February 2010 (UTC)
New or Fix existing typos
I have come across a couple typos that are either not working or need to be added. Below are a few that I have found that either need to be added or don't seem to be working.
occassion to occasion. This exists in the typo list but doesn't seem to work all the time.
Philidelphia to Philadelphia. This exists in the typo list but doesn't seem to work all the time.
The rule for "Occasion" seemed correct for the case you cite, but I expanded it a little anyway to catch more misspellings.--
BillFlis (
talk)
20:51, 9 March 2010 (UTC)
Looks like Distict is changed to Distinct. ("<Typo word="Distinct_" find="\b(D|d)is(?:ctinc|tic|inc|t[ai]n(?=ti))t(i(ve|on|vely)|ly)?\b" replace="$1istinct$2" />
") But it might as well be a typo for District. (Especially if capitalized).--
ospalh (
talk)
09:58, 18 March 2010 (UTC)
Why is "occasionanlly" corrected to "occasionnally", from an incorrect spelling to another incorrect spelling? I know that there is a rule for -anlly -> -nally, but it shouldn't apply to that case.
PleaseStand(talk)02:40, 25 March 2010 (UTC)
While AWB caught that 'their' should've been 'there', it missed 'on bored'. Mephistophelian
†14:52, 26 March 2010 (UTC)
This doesn't appear to be a very common misspelling:
[36] and there are also appropriate uses of the words "on bored" together, such as: "...blames anti-social behaviour in her area on bored News Night presenters...". –
xenotalk14:55, 26 March 2010 (UTC)
Added "Establishment". "Amry" seems to be a proper name, as are definitely "Thier", "
Thiers", and "
Etal", so need caution. I didn't find any occurrences of "aviaror" in wikipedia.--
BillFlis (
talk)
18:56, 26 March 2010 (UTC)
AWB fixed "acidentaly" with "acidentally"
here, but it should've been "accidently" of course (I later fixed it manually in the article). --
bender235 (
talk)
22:56, 27 March 2010 (UTC)
I would like to suggest we put this in a <syntaxhighlight lang="xml">, add an <?xml> note, create a simple inline DTD, and encase the typos in a <typos> tag. This would allow easier use by common XML parsers like expat and DOM.--
Ipatrol (
talk)
03:03, 26 March 2010 (UTC)
Each section is already so encased; it doesn't show up on the page, you have to edit it to see it. What are the advantages of doing what you're suggesting? Why, for example, would anyone ever want to use an XML parser (whatever that is!)?--
BillFlis (
talk)
10:28, 26 March 2010 (UTC)
Its a programming technique that makes it easier for applications like AWB to pull the information in. It could potentially be used by applications outside WP. For example if you created a program that looked at word documents for typos you could use this list rather than create your own. --
Kumioko (
talk)
11:11, 26 March 2010 (UTC)
That would require rewriting whatever AWB uses to parse this input, which I assume is a very slight modification of some Microsoft .Net format (or whatever this thing's written in). The Microsoft regex library is actually pretty advanced with its look backs and look forwards, etc. Anyway, wouldn't it be simpler (or less disruptive maybe) to write a parser to translate the awb page into the XML format?
Actually, if I get really bored that's something I might be interested in doing. What exactly is the target format (what xml ref specifically)?
Shadowjams (
talk)
05:21, 8 April 2010 (UTC)
Transsexual vs. Transexual Menace
While granting that "transsexual" is the more common spelling (albeit not a politically neutral one), people using this bot keep changing the name of the "Transexual Menace", which is definitely spelled with one s. Can somebody with a better grasp of the flavor of regex used here put in an exception for that organization?
Shmuel (
talk)
20:26, 29 March 2010 (UTC)
As you can see by looking at this one it is easy to miss with the naked eye and there are quite a few hits for it. Many if you include WP: and User: etc. It doesn't seem to be covered on the list (I had a go at writing the line but definitely didn't have it right sorry). There is also a section above which has the request for switchs -> switches. Many tahnks ~
R.
T.
G21:31, 1 April 2010 (UTC)
'Swtich' seems like a genuine typo, although I can also imagine cases where the user mistyped the 'w' and the result should have been 'stich'. 'Switchs' could be a typo of "switch's" or 'switches'. Mephistophelian
†21:48, 1 April 2010 (UTC)
That would mean "switch ownings" possesive or "switch has", "switch is" etc., wouldn't it? The page
switch has hundreds of the word but none of "switchs" or "switch's". The same goes for
w:switch with 25 matches for "switch" but none for the others. ~
R.
T.
G01:02, 2 April 2010 (UTC)
Actually in searching for it "switch's" there are 53 hits for it in the possesive. "The switch's place...", "The switch's number..." etc. ~
R.
T.
G01:05, 2 April 2010 (UTC)
I may have missed some of the above, but this regex should work:
I haven't tested it yet, I may later, but if someone does before me and it performs well, please put it on the list.
Shadowjams (
talk)
01:50, 2 April 2010 (UTC)
The rule was <Typo word="Switch" find="\b(S|s)wti?ch\b" replace="$1witch" />, and it's been added and expanded. Mephistophelian
†04:36, 2 April 2010 (UTC)
It's testing well. I had the extra t? in there to catch for another typo, but I didn't see any examples of it when testing, so I doubt it's a common typo. I also didn't know you could use the (|x|y|z) construction to work the same as (x|y|z)?. I don't think there's a meaningful difference between the two, but interesting to note.
Shadowjams (
talk)
05:12, 2 April 2010 (UTC)
It took me a minute to realize what you were saying... yeah, those are unlikely words, but I doubt they'd be false positives either. On a scan of the last database dump I only found 9 instances of "switches" being misspelled in the same way. We do have a somewhat reflexive tendency to reform the regex rules, but I don't think there's any harm in this instance.
Shadowjams (
talk)
08:28, 2 April 2010 (UTC)
restricitve -->restrictive
It looks like AWB misses this incorrect spelling "restricitve" for the word "restrictive". Here is the dff for this one:
[37].
And it appears that AWB consistently misses mispellings of the word "metamaterial". Here is one example:
[38], although it is the plural of the word. I think the singular comes out the same. As you can see "Metamaterails" is this particular misspelling. Another misspelling that I myself do is "Metamterials" (I leave out an "a" from time to time). If you can add these, it would be much appreciated.
Steve Quinn (formerly Ti-30X) (
talk)
02:11, 9 April 2010 (UTC)
This is a suggestion, perhaps the term mileage should be replaced with fuel economy; since it is technically more accurate and can resolve English variations. --
JovianEye (
talk)
18:32, 19 April 2010 (UTC)
I don't think changing English variations is a good fit for AWB/T. Mileage also has other uses besides the fuel economy one, which would be hard to detect with a regular expression. --
JHunterJ (
talk)
18:49, 19 April 2010 (UTC)
Agreed. This is a good sort of thing to do with the find replace portion in AWB, because then the user's aware of the peculiarities of the expression. I have a few of these listed that I use
here. They can work well, but they occasionally require a human touch.
In fact, just searching for the word mileage, I didn't see a single example on the first page where that change would be correct, or beneficial. In most vernaculars too, even when the change would be ok, it makes the sentence more awkward. I'd be cautious before changing too many of these.
Shadowjams (
talk)
19:37, 19 April 2010 (UTC)
As they say,
your fuel economy may vary. But I can see myself changing some instances of "better mileage" with "greater fuel economy", although only by hand. In the U.S., the government-recognized measure of a car's fuel economy is in miles per gallon (mpg), hence the prevalence of "mileage" in this sense.--
BillFlis (
talk)
22:19, 19 April 2010 (UTC)
I'm not entirely sure about this one. On the one hand it is correct to add dot to show abbreviation on the other hand most of the time the dot becomes a full stop in a sentence, which is also not useful. Regards,
SunCreator(
talk)16:43, 27 April 2010 (UTC)
I'm not sure I follow. The rule right now won't add an additional dot if there already is one, and there's no difference typographically between a period and a fullstop. It also fixes "ect" typos.
Shadowjams (
talk)
18:58, 27 April 2010 (UTC)
The rule adds a dot if there isn't one. One sentence looks like it is two as 'etc.' can look like the sentence ended. Regards,
SunCreator(
talk)19:27, 27 April 2010 (UTC)
e.g. "Table showing key dates, mileages, running numbers, etc for all class members." -> "Table showing key dates, mileages, running numbers, etc. for all class members." Regards,
SunCreator(
talk)19:35, 27 April 2010 (UTC)
Ah, I see. Well that's it's expected behavior. I pulled the rule from
Wikipedia:Manual of Style (abbreviations) in the table. Actually I've done most of those rules there that are amenable to being fixed without a lot of issues. For example, I haven't done this with "Ltd" or "St" or "Rd" because in commonwealth spellings the period is left off... but I have done it in latinized abbreviations because I think that (maybe I'm wrong) the punctuation is appropriate in those cases in all English traditions.
A side note, I fix "a.k.a." myself, but I haven't put it into the typo rules because it's used in templates (which typo might not mess with) as well as it comes up occasionally in legitimate areas, so typo-fixing is too blunt an instrument for those cases. I haven't seen any trouble with etc. like that, but I'll see what others have to say too.
Shadowjams (
talk)
19:41, 27 April 2010 (UTC)
Can you create a list of false positives with AWB? I don't mean log the article but the actual spelling that you are double clicking to skip? Regards,
SunCreator(
talk)15:00, 1 May 2010 (UTC)
FWIW, I did some research and at one point "hone in on" considered incorrect; but language being socially-constructed, it appears to be
takinghold. –
xenotalk22:17, 4 May 2010 (UTC)
From
Jagdish Lal Raj Soni "He understook a course in Design from London University in 1968."
A false positive to change Understook to Understood but instead should be "He undertook a course" etc. Regards,
SunCreator(
talk)16:23, 5 May 2010 (UTC)
This one is hard. The underlying rule is <Typo word="(Mis)Understood" find="\b(U|u|[Mm]isu)nderstoo[^d]\b" replace="$1nderstood" /> which will correct any "understoo" phrase whose last letter is not a d. I suppose we could change it to <Typo word="(Mis)Understood" find="\b(U|u|[Mm]isu)nderstoo[^dk]\b" replace="$1nderstood" /> and then create a new rule of find="\b(U|u|[Mm]isu)nd[ea]rs(took(?:en)?|taken?)\b" replace="$1nder$2", but really all that rule fixes for is typos with an extra s. I don't know how common those are. (also that rule needs tested before it's inserted).
Shadowjams (
talk)
23:53, 5 May 2010 (UTC)
About <Typo word="Saskat(chewa/oo)n" find="\bsaskat(chewa|oo)n\b" replace="Saskat$1n" />
: there is a berry called saskatoon, or
amelanchier alnifolia as well as the city of
Saskatoon, so maybe we shouldn't automatically capitalise it.--
ospalh (
talk)
14:29, 7 May 2010 (UTC)
I'll change it to preserve the case, because I don't think there's any good way to distinguish between the two cases.
Shadowjams (
talk)
23:31, 7 May 2010 (UTC)
Actually I disabled it because it didn't make any changes except to capitalize it. If there are common misspellings of it though, let me know what those are and I can try and work them into the rule and have it preserve case.
Shadowjams (
talk)
23:37, 7 May 2010 (UTC)
I'm not sure what you mean by "missed this one", but this is apparently a single instance of this error in all of wikipedia. Do you really think we need to establish a permanent rule to search for this weird error ("idmediatly") every time somebody runs AWB?--
BillFlis (
talk)
01:34, 10 May 2010 (UTC)
Yes, we want to avoid rare/implausible misspellings but few current matches doesn't necessarily mean there won't be some more on a regular basis in the future. If we can expand an existing rule then let's do it.
Rjwilmsi10:24, 10 May 2010 (UTC)
This should do it. I've tried to make it broader to justify its purpose more. Someone might test it first.
The spelling corrections are easy, so long as there aren't conflicts with other correctly spelled words. The spacing is an issue because periods and commas are used in other contexts where spacing wouldn't be appropriate. For example, in chemistry articles (off the top of my head). In addition, I worry about the extra processing required to have regex threads running on every punctuation.
Thanks. Is it possible to include a spelling and grammar checkers in the edit window similar to the email systems or as in MS Word? I am not sure if a javascript can handle such a plug-in. But if this is possible, it would help the editors a lot before saving articles. --
Thaejas (
talk)
00:51, 14 May 2010 (UTC)
That would be a developer issue. The typo engine runs (I think) as a plugin and so it can't affect the UI like that.
Shadowjams (
talk)
03:44, 14 May 2010 (UTC)
As for moving templates, there are lots of templates that belong on a specific part of a page, not at the end. Infoboxes, for example. In fact, anytime you see something in {{ something here }}, that's a transclusion, meaning it's copying content from another source to the current page. If there's no prefix to it, it's assumed to be a template (e.g., {{Template:Infobox person}} is the same as {{Infobox person}}. Also, moves like that are better handled through the AWB developers directly rather than through the typo rules, which are less powerful than what you can do with a full programming language.
Etc., i.e., and e.g., don't include trailing commas intentionally, because they can be used in cases where that is not advised, there might be other punctuation, at the end of a sentence, etc., although maybe the rule should ensure some sort of punctuation afterwards. Anyone else have ideas about that?
Finally, I'm not sure what you mean by the full-stop after reference tag. Reference tag punctuation is a common problem, but it's not quick to fix. I have an expression I use to fix it
here. You could use it in your own personal find-replace settings.
Shadowjams (
talk)
18:05, 13 May 2010 (UTC)
Thanks. Your fix is precisely what I am looking for. But can this fix be a part of the general typo fixes instead of being in editor's find-replace? --
Thaejas (
talk)
00:51, 14 May 2010 (UTC)
I would probably be ok with that, but others might not be. I've been using that fix for a long time and I have almost no problems with it. However, it doesn't handle indefinite number of reference tags, only a finite number, and each time you up that number it increases the processing workload. I don't know how it would affect the overall processing power, but I'd also like to hear others discuss that.
I'll do these in their own section so they're easier to follow. The "different" rule doesn't correct, or corrects as it does, because the 2nd group (the non capturing one) has a rule for "f" alone by itself, without the e, but most of all because it accepts any suffix to it. I think that's probably a poor design on the end, so I've changed it to require a trailing t, or some similar suffix. If this makes it miss a lot of stuff, let me know.
Shadowjams (
talk)
05:01, 14 May 2010 (UTC)
Actually I changed it by adding an extra optional e. This is easier and allows the expansive word capture on the end. I don't think that's good design generally, but it seems to work for now.
Shadowjams (
talk)
05:08, 14 May 2010 (UTC)
Manoeuver maneuver UK/US
Quoting user Mjroots on my talk: "Manoeuver is British English spelling, whereas maneuver is American English."
The "other then" -> "other than" rule produces some false positives in cases like "Other then-popular things [...]" or "Other then-known stuff [...]". I suggest someone should add an exception to that rule, saying that "if 'then' is followed by any letter, it should not be replaced with 'than'". --
bender235 (
talk)
15:46, 18 May 2010 (UTC)
I'm curious why AWB replaces "Catholic church" with "Catholic Church", and "Methodist Church" with "Methodist church". I think the spelling should be consistent (I'd prefer lower case, for that matter). --
bender235 (
talk)
15:46, 18 May 2010 (UTC)
But the
Catholic Church uses uppercase. The Catholic Church is the worldwide entity, each Methodist church is a building serving a local congregation, as I understand it. --
JHunterJ (
talk)
15:57, 18 May 2010 (UTC)
Other than
If you run RegExpTypoFix on
Ibn Battuta, you will see that "The other then sailed away without him" is changed to "The other thansailed away without him". This is a false positive, of course, and it would be too hard to put that right, but why did the rule drop the space after "then"? Should the "replace" string be "$1 than$2"?
<Typo word="More/Greater/Less/Rather/Other than" find="\b([Mm]ore|[Gg]reater|[Ll]ess|(?:[Rr]a|O|o)ther)\s+then(?:\s)" replace="$1 than" />
John of Reading (
talk)
20:24, 18 May 2010 (UTC)
If the rule excluded capitalized versions and italicized versions, would that eliminate most of these false positives?
Shadowjams (
talk)
06:20, 23 May 2010 (UTC)
Ok, I'll test out a version like that. Exclusions are a reg-ex nightmare :P but let me see if I can make it work.
Shadowjams (
talk)
06:32, 23 May 2010 (UTC)
I'm having some trouble with this. For example, try \b(!:Fondation)(F|f)o(?:ud?n|nd)ation(s|al|ally|less)?\b and replace it with what's in the rule right now ($1oundation$2). If your input is "fondation" then it skips it, although I would think it shouldn't. Is it not case sensitive on the (!:.) capture group? Is this a .net thing or am I just making some mistake here? (by the way, this is a simplified version of a broader issue I'm having with the !: groups).
Shadowjams (
talk)
07:01, 23 May 2010 (UTC)
I think you mean (?!pattern), the zero-width negative lookahead assertion? But I prefer the negative look-behind at the end, as theoretically faster (so that it only looks behind if a potential match has been identified): (?<!pattern) --
JHunterJ (
talk)
12:36, 23 May 2010 (UTC)
etc => etc is most common correction by far. So it's worth going over.
/etc => /etc. ✗ Fail - the /etc is a common computer folder
etc) => etc.) ✓Pass - common and correct
etc(end of line character) => etc. ✓Pass again common
etc, => etc., ✓Pass common
etc; => etc.; ✓Pass
etc. => etc. i.e No change ✓Pass
etc any word => etc. any word ✓Pass✗ Fail technically correct, but I would prefer => etc., I amend most of these manually. The following space indicates it's used mid sentence and so etc., fits nicely.
I authored that rule; good catch on the unix filestructure ones -- that needs to be an exclusion. The last one, are you talking about the fact there's no trailing comma? If that's it, it's really hard to distinguish between when there should and shouldn't be one (I think) unless I'm misunderstanding that.
Shadowjams (
talk)
06:57, 23 May 2010 (UTC)
When etc is follow by a space (indicating a word is following) comma is good. So 'etc ' => 'etc., ' but when other character like closed bracket a comma, or end of line then 'etc' => 'etc.' Regards,
SunCreator(
talk)07:03, 23 May 2010 (UTC)
Ok. I fixed the unix filestructure issue (I hope). I'll handle that other one tomorrow (if someone else doesn't first). Thanks
Shadowjams (
talk)
07:07, 23 May 2010 (UTC)
Sorry for the trouble on this one. Could someone help me with why this doesn't work: find="(?!/etc)(E|e)(tc\b([^\.\w])|ct\b\.?)" Replace="$1tc.$3"Shadowjams (
talk)
02:29, 24 May 2010 (UTC)
Okay, well quite a few will be in quotes which makes it part of general fixes rather then typos anyhow. Regards,
SunCreator(
talk)15:38, 24 May 2010 (UTC)
Capitialisation of countries, religions etcetera in parts of words
Resolved
"the Slavic population was germanized by Germans" => "the Slavic population was Germanized by Germans"
"during the christianization of the eleventh century" => "during the Christianization of the eleventh century"
Capitalization like those seems a bit odd. I have been skipping them, but I'm unsure of what is correct. Regards,
SunCreator(
talk)11:29, 23 May 2010 (UTC)
the protein
NF-κB should not be written with
k, but rather the greek letter
κ (kappa)
it should also be written exactly in that way: with 3 capital letters and the greek one correct:
I think a false positive is when the typo lists tries to fix something that's not broken. This is broken, but the fixer guessed wrong on the fix. The AWB user should catch that, but if not, the article is no worse off -- it's just as wrong as it was to start with. In this case, I'm not sure how to make it distinguish between those two possible fixes. --
JHunterJ (
talk)
19:10, 24 May 2010 (UTC)
It also wants to replace utilidoor with utiTCoor. Nothing in the settings file and it want's to make the changes even if the typo fixing is off.
Enter CBW, waits for audience
applause, not a
sausage.
21:49, 24 May 2010 (UTC)
Never mind. I have no idea what caused that but deleting AWB and the settings that were stored in a separate folder followed by redownloading it has fixed the problem.
Enter CBW, waits for audience
applause, not a
sausage.
22:35, 24 May 2010 (UTC)
ie
Resolved
I've been getting a lot of false positives changing "ie" to "i.e." in hostnames with ie, the Irish top-level domain. For example, downloadmusic.ie in the article
2008 in Irish music. The current rule is:
Yeah, I'm aware of that problem. Most of those should be avoided if they're in a full url, but the ones that aren't in link templates won't be. It also shows up on a few other web addresses. One possibility is to add (?!\.ie\b) as an exclusion to the beginning (I've had a lot of trouble with those lately so I'll let someone else test that before adding it in).
Shadowjams (
talk)
10:33, 26 May 2010 (UTC)
I noticed there is quite a few to skip past. Why can't .ie be ignored? Surely it's enough to have a dot infront rather then checking for a complete url? Regards,
SunCreator(
talk)14:19, 26 May 2010 (UTC)
Well, you can use "French" but "french fries" ("sometimes capitalized") and "french-fried" don't need "correcting".--
BillFlis (
talk)
Good point. Should we exclude that one example, or is the rule generally problematic? I think we do the language capitalizations generally, notwithstanding other similar examples.
Shadowjams (
talk)
08:52, 30 May 2010 (UTC)
This rule would be the problem <Typo word="-ining" find="(?!\b(?:(?:Br|Kl|M|H|St)e|Nar|Kurt|Lap)inig\b)\b(\w+)inig(s|ly)?\b" replace="$1ining$2" /><!--Don't match (Br/Kl/M/H/St)einig, (Nar/Kurt/Lap)inig-->.
I'm honestly not sure exactly what that rule's fixing. Maybe someone can explain it, in which case I'd be more comfortable adding the exclusion for Closedmouth's example.
Shadowjams (
talk)
06:02, 9 June 2010 (UTC)
It fixes typos like "beginig". No harm to add a new rule for "-inninig" to "-inning" above this one.
Rjwilmsi09:39, 9 June 2010 (UTC)
I noticed today that there are many articles with the word "Olympic" or "Olympics" misspelled. Common misspellings are "Oylmpic", "Olmypic", and "Olypmic". Would a bot be able to fix these spellings, or am I in the wrong place? Thanks,
GaryColemanFanUser Talk:GaryColemanFan 9:05 pm, 27 May 2010, Thursday (19 days ago) (UTC−6) --
Cit helper (
talk)
06:04, 16 June 2010 (UTC)
I added a rule here. It corrects your suggestions "Olmypic" and "Olypmic", as well as "Olypic" and "Olymic" (and of course all their plurals), but I was not able to find any instances of "Oylmpic", so that's not included.--
BillFlis (
talk)
Someone please add a rule that replaces "Nurnberg" with either "Nürnberg" oder "Nuremberg" (I suggest the latter would be more appropriate). --
bender235 (
talk)
21:21, 18 June 2010 (UTC)
In particular, I question whether "Medlys" and "Medlies" should be included here, in a separate rule ("Medly" seems quite common), or not at all.
PleaseStand(talk)23:56, 19 June 2010 (UTC)
I'm testing it now. It's mostly catching "attornies". Don't see any issues with it yet.
Shadowjams (
talk) 04:09, 20 June 2010
I would like to use this great plugin for my language but when i try to enable RegexTypoFix checkbox it is saying it will load typos list from english wikipedia. But I want to set it to download from my own langauge wikipedia. How can i do this? --
Mahir78 (
talk)
10:29, 22 June 2010 (UTC)
Add <!--Typos:http://en.wikipedia.org/?title=Wikipedia:AutoWikiBrowser/Typos&action=raw-->, replacing the en with whatever language you want, to the local checkpage. —Reedy10:40, 22 June 2010 (UTC)
Playright -> Playwright
There is a publishing house "Playright publishing". Is there a way to make sure the word is not replaced when it is either 1. capitalized or 2. followed by the word "publishing" ?--
Muhandes (
talk)
10:37, 23 June 2010 (UTC)
I'm not sure how to add this but it is very common. Currently it does childrens' → children's' which is incorrect.
If someone could add this it would be most helpful. --
Muhandes (
talk)
14:11, 24 June 2010 (UTC)
I just had it work correctly at least in one case. It might be that the times when it didn't work were due to ’ used instead of '? I will have to supply an example of a page not working correctly I guess. --
Muhandes (
talk)
14:43, 24 June 2010 (UTC)
We have childrens → children's and womens → women's, why not mens → men's ? If this is appropriate, can anyone add it please? --
Muhandes (
talk)
09:17, 25 June 2010 (UTC)
Please extend the -ely rule to catch that. I am also considering "falsly" and "sparsly" but am unsure whether it would be worth the processing time.
PleaseStand(talk)01:15, 26 June 2010 (UTC)
I'll check it out. I wouldn't worry about the processing time for those too much. Strangely though, that rule only finds those roots that have "in" or "un" at the front. I think that's unintentional... adding a ? to that first group would allow it to find all permutations. I'm testing that rule right now to see if there's some reason for it.
Shadowjams (
talk)
02:41, 26 June 2010 (UTC)
Is there a reason the search string ends with a space (\s) rather than a simple word boundary (\b)? The "then" (for "than") could be followed by a comma (perhaps separating a parenthetical phrase); e.g., "other than, say, sausages".--
BillFlis (
talk)
15:11, 28 June 2010 (UTC)
Because we want whitespace, not a word boundary, to avoid false positives when "then" is an adverb and not a misspelled preposition. For instance, since (back then) I thought that was the explanation, I didn't say any more then. --
JHunterJ (
talk)
15:23, 28 June 2010 (UTC)
AWB ignores typos in the version containing the link [[metropolitan bishop|metropoltian]], but it works with the version without that link.--
Diwas (
talk) 13:08, 29 June 2010 (UTC) (I had added the new typo rule yesterday.)--
Diwas (
talk)
13:11, 29 June 2010 (UTC)
The AWB Regex Tester is replacing [[metropolitan bishop|metropoltian]] with [[metropolitan bishop|metropolitan]]--
Diwas (
talk)
14:06, 29 June 2010 (UTC)
after edit conflict: Thank you for your answer. Now it works. Originally the link was correct, but I guess
this correction of my simple rule was making it working. I guess if a rule match a link, the rule will be ignored in this article. But my bad rule was matching the correct spelling. thanx --
Diwas (
talk)
12:51, 1 July 2010 (UTC)
separete/separeted
I notice there is separeble but not separete/separeted. It is quite a common typo. Would be nice if someone could add it. --
Muhandes (
talk)
15:25, 29 June 2010 (UTC)
It didn't until I modified it a couple of days ago. I should have commented here that it had been handled.--
BillFlis (
talk)
13:33, 1 July 2010 (UTC)
Request addition
Could someone please add:
ascession --> accession
it's in many many "list of monarchs" type articles and it's a blatant misspelling, there's too many for me to fix them all manually. --
Ϫ22:12, 22 June 2010 (UTC)
I considered that and actually checked and pretty much all of them deal with accession. The search turns up nothing but lists of consorts etc. Besides, ascession is much more likely to be mistaken for accession because of the similar sound, and "ascension" isn't commonly misspelled. --
Ϫ05:34, 23 June 2010 (UTC)
I'm not in the process of testing it right now, but if you'd like to, try this: Find: "\b(A|a)sc+es+[io]{2}n\b" Replace: "$1ccession". The extra stuff in the middle should catch the "io" "oi" switch, and I'd guess that ascension misspellings will probably include an "n" somewhere, which would exempt it from that regex.
Shadowjams (
talk)
05:58, 25 June 2010 (UTC)
Oh you mean for me to test it? No I don't normally use AWB, I don't have it installed.
Hah. I'm sorry; I'm not avoiding it, I'm not sure if anyone else is, but I wouldn't see a reason why if that were so. I don't have a wiki dump handy right now which is why I can't test it immediately [I did earlier but I forgot about this one]. I'll try and take a look soon. I don't foresee any issues with what I proposed above, but I get a little cautious around these British monarch-related changes because they're used in all sorts of ways that I can't begin to comprehend, so I like to test those. I am pretty cautions though, it's not a catastrophic event if they're added and then later tweaked.
Shadowjams (
talk)
09:00, 27 June 2010 (UTC)
It just seemed like some are hesitant about adding it. So if some readers here need some reassurance, I did my homework on this. The
search for "ascession" gives only 47 results, while
"accession" gives 11,334 results! The search for "ascession" turns up almost all "List of ____ consorts" type articles. In all of these articles the word is used in the context of the definition of "
accession", not "ascension", or anything else. These articles all have similar tables in which this word appears multiple times, so I'm thinking the same person created all these tables and used the same misspelled word in all of them, not knowing that "ascession" isn't even a word! I checked in multiple dictionaries and even asked the gurus over at
Wiktionary's Tea room. So I'm quite certain it's safe to add this to the list! :)
Ϫ08:26, 28 June 2010 (UTC)
Wow. Sorry that things here weren't happening fast enough to please you. We hate to see you go, really, because we are entirely at your service, and your complete satisfaction is our only goal. The thing is, some of us are
Old Farts, who check our email only about every couple of hours. Even then, we tend to think a bit before we act. Oh, and you forgot take a number, so we didn't even see you there at the end of the queue.--
BillFlis (
talk)
00:17, 1 July 2010 (UTC)
I did put in a rule that you could try out. Presumably you use AWB, so you could plug it in and try a few. I haven't gotten around to doing that. It's nothing personal. I think that rule will work without any problems and someone can axe it if it starts acting up.
Shadowjams (
talk)
09:01, 1 July 2010 (UTC)
Let's bury the hatchet you two. My regex from above will probably blank the main page. Actually... that'd be much more impressive than anything I've actually contributed. Let's hope for disaster.
Shadowjams (
talk)
10:47, 2 July 2010 (UTC)
May someone please add this to the misspelling list, to be replaced with "practice"? I'm a bit intimidated by the code. :)
Search results bring up quite a few occurrences that are tedious to be fixed manually. Thanks,
Airplaneman ✈06:19, 5 July 2010 (UTC)
...or maybe not. "Proactive" could be a possibility as well. I'll go through the search and manually fix them :)
Airplaneman ✈06:26, 5 July 2010 (UTC)
I think "non-metropolitan" should be replaced by "rural." And "semi-metropolitan" by "small-town". What do you think?--
BillFlis (
talk)
22:01, 5 July 2010 (UTC)
I am not sure, I am not nativ-english, but the word rural entered my mind when I was reading non-metropolitan. But non-metropolitan is a
legal term in England and the rule above covers all ...-metropolitan words. Semi-metropolitan is a rare word. I am not sure if there are other words with -metropolitan. --
Diwas (
talk)
07:56, 6 July 2010 (UTC)
Non-metropolitan isn't a word I think I've ever heard, and semi-metropolitan is just as weird. I am a native speaker, and rural is not an antonym of metropolitan. This is the kind of example of what this project isn't appropriate for, although may be an appropriate fix in some cases.
Shadowjams (
talk)
08:00, 6 July 2010 (UTC)
Staus
Does this make sense? I want to replace "staus" with "status", but only when not capitalized (to avoid the surname). The misspelling seems to be very common. Thanks,
PleaseStand(talk)02:15, 6 July 2010 (UTC)
I knew that, so I have now added that rule. All or almost all occurrences of lowercase "staus" shown in a Wikipedia search should have been "status".
PleaseStand(talk)02:49, 6 July 2010 (UTC)
This appears to be effecting many articles and may be a legitimate spelling of the name since it's so prolithic across Wikipedia.
Can we discuss this? Just want some reassurances that all these edits I'm doing won't have to be reverted. Not against it if it's right.--
mboverload@00:49, 28 June 2010 (UTC)
It appears to me that the intent of this rule is only to capitalize it and make it two words if it appears as one. Is it doing something else?--
BillFlis (
talk)
09:44, 28 June 2010 (UTC)
Being prolific across Wikipedia is not an indication of legitimacy. Unless there is a reliable source that indicates it should not be capitalized or should not be two words, you have at least my reassurances that those edits shouldn't be reverted. --
JHunterJ (
talk)
11:25, 28 June 2010 (UTC)
The rule has been restored. That's all it was doing. The reference desk said either could be accurate. Might as well stick with one. --
mboverload@06:21, 5 July 2010 (UTC)
I'd like to point out (in case it wasn't clear) that although the official name is indeed Tamil Nadu, correcting it is in many cases wrong. Specifically, as part of an organization's name, as we all agree organization names should not be "corrected" (my favorite example being
Childrens Hospital Los Angeles). As some/most people are not aware of this and might be tempted to "correct" such instances, and it is indeed very prolific, it might be best to be prudent and not include this rule. --
Muhandes (
talk)
08:44, 12 July 2010 (UTC)
I changed the "-keted" so it wont catch
racketts, but still catches bracketted. I hope I did it correctly, first time I try my hand at this. --
Muhandes (
talk)
11:00, 12 July 2010 (UTC)
Looking at the rule. it also captures the ending "s" and "ing", so in fact it catches "-keted", "-kets", "-keting". --
Muhandes (
talk)
12:07, 12 July 2010 (UTC)
I've noticed that the "publisher=" parameter of the cite templates is widely misused to specify the name of the newpaper or magazine; sometimes the person responsible realised that the name should be italicised, so they've manually added italics e.g. "publisher=''The Times''". Of course, the real problem is that it's the wrong parameter - what's really needed is "work=The Times". I've set up my own find and replace regex to correct anything of that form, specifying a long list of widely quoted newspapers and magazines. Could/should this be added to the list of automatic corrections somehow?
Colonies Chris (
talk)
16:24, 13 July 2010 (UTC)
Was this
dispute settled? I thought there was still a discussion on it. My (very limited) understanding is that the website is the work, the publisher is the entity behind it, so isn't "publisher=The Times" correct? --
Muhandes (
talk)
18:22, 13 July 2010 (UTC)
|work= is the name of the publication/periodical/newspaper/website so is "The Times" for www.timesonline.co.uk etc. If |publisher= is used then it's the parent company of the website (perhaps Times Newspapers Limited or News Corporation in this case); publisher isn't used for well known publications as it's no extra use.
Rjwilmsi18:44, 13 July 2010 (UTC)
I believe we can do better than what we currently have. I was considering the above proposed regex to match "subsidiery" and its variants, but I don't want it to match "subseries".
PleaseStand(talk)04:13, 21 July 2010 (UTC)
It is a little messy though, and it's surprising how little it matches in terms of misspellings.
Currently, "nera" is fixed to "near".
Glyka Nera is a place in Greece and AWB suggested changing "Nera" to "Near". This of course is wrong and had it been in a large article full of suggested changes, I may not have noticed it. "Nera" with a capital N should not be corrected.
McLerristarr (Mclay1) (
talk)
12:18, 1 August 2010 (UTC)
The word "near" could begin a sentence, such as, "Near the opera house is the city hall." Some of these things just have to be tolerated -- not saying this is necessarily one, but just sayin'. --
Auntof6 (
talk)
17:00, 1 August 2010 (UTC)
Well, we can't possibly correct all typos. Someone could type "three" instead of "there", so that will never be corrected. It's better to be safe than sorry, we shouldn't rely on machines to do everything for us – good ol' copy-editing is always best. So in the case of three/there and Near/Nera, they should be left alone for people to find when reading. Perhaps "Nera" could only be left alone if it follows "Glyka". I don't know if that's possible.
McLerristarr (Mclay1) (
talk)
02:36, 2 August 2010 (UTC)
It's trivial to exclude the uppercase version, or to exclude "Glyka Nera" or similar constructions. Is the proper use of "Nera" identifiable from the typos by excluding times it's followed by Glyka?
Shadowjams (
talk)
03:06, 2 August 2010 (UTC)
I'm fine with that. I doubt it's a common typo, and it's easily spotted by regular editing. I'll go half-way and change the rule to only correct non-capitalized versions. Someone else can remove it completely if that seems appropriate.
Shadowjams (
talk)
03:59, 2 August 2010 (UTC)
Can somebody please add "sapce" to change to "space" and "sapces" to change to "spaces". It is an easy typo to make and currently the typo exists in seven articles. In every case, it is a typo and not a foreign word.
McLerristarr (Mclay1) (
talk)
07:50, 2 August 2010 (UTC)
I noticed that AWB tries to change Enmei to Emmei in places such as "Enmei ryu" (a martial arts school) and "Enmei ji" (the name of a Buddhist temple). I always leave the page at Enmei because I have seen this spelling in various places online. But I have not been able to find a definitive answer as to witch is correct. Also I notice that the family name "Ie" gets picked up and changed to "i.e.". So those using AWB on Japan related pages need to take extra care before saving.
Colincbn (
talk)
06:26, 3 August 2010 (UTC)
That could work although it of course has to be case by case. Non-English words are forever an issue when trying to write a new rule.
Shadowjams (
talk)
08:10, 11 August 2010 (UTC)
A search turned up no instances of "canvern" on wikipedia. You must be doing a good job of correcting yourself!--
BillFlis (
talk)
17:59, 13 August 2010 (UTC)
Well, I usually edit with Safari, which has an automatic spell check so I usually notice when I make a mistake. I was thinking more for other editor's sake, but since the typo does not exist at the moment, it's probably not worth adding.
McLerristarr (Mclay1) (
talk)
07:28, 14 August 2010 (UTC)
i.e. and e.g.
"i.e" should be correct to "i.e." ("e.g" already corrects to "e.g.")
a colon after "i.e.", "i.e", "ie", "e.g.", "e.g" or "eg" should be removed as it is completely unnecessary and yet common
Interesting. As to your first point, it took me a little bit to figure out why it's doing that because when I wrote the rule I did it largely to correct that problem. Whatever you're running it on that doesn't correct is a case where "i.e" is not followed by either a single quote, a space, a colon, a comma, a semi-colon, a close parenthesis mark, an ampersand (for non-breakable spaces, etc.), or a dash. Do you have an example of a page with that in the wild? It was somewhat intentional as a safety feature to not over-correct. Perhaps using \b would be sufficient, but the rule as it is now is very stable.
As to the second, I'd invite others to comment on that. I'm not enough of a style wonk to know the right answer to that.
Shadowjams (
talk)
08:05, 11 August 2010 (UTC)
Here's a proof of concept on the first point: perl -e '$x="i.e";$x=~s/\bi(?:\.?e|e\.)([\s,:;\)&-])(?<!\.ie.)/i.e.$1/;print "$x\n"' does not correct, while perl -e '$x="i.e ";$x=~s/\bi(?:\.?e|e\.)([\s,:;\)&-])(?<!\.ie.)/i.e.$1/;print "$x\n" does.
Shadowjams (
talk)
08:06, 11 August 2010 (UTC)
Right. It will correct "i.e " but not "i.e" It's rare if not non-existent in articles (i.e. supposes some text after it so it should have one of the demarcating characters; if it doesn't, it likely isn't the abbreviation).
Shadowjams (
talk)
08:33, 11 August 2010 (UTC)
I meant that it's so unlikely that it's not worth making a rule here for. An error in a far-fetched list like that is less likely than someone trying to type "
Ile" or "
ile" and accidentally hitting the period key for the "l".--
BillFlis (
talk)
13:45, 11 August 2010 (UTC)
If that were true, it would have to be in a list as well, or at the end of a paragraph that is missing a full stop. I just thought that making "i.e" always correct to "i.e." no matter what followed it would only require deleting the code that specified something followed it. I don't know though, I have no idea how this thing works. Either way, what's happening about the second point?
McLerristarr (Mclay1) (
talk)
03:16, 12 August 2010 (UTC)
It will work if it's followed by a space or any of these characters (in bold): ' : , ; ) & -. My reason for writing it this way was to avoid situations where ie might be used in some different, but correct way. I don't remember what exactly prompted that, maybe I found something testing or maybe I was being overly cautious. It's also important that rules don't catch correct versions of the words, and this helps with that, although you could do it other ways too.
Shadowjams (
talk)
19:27, 13 August 2010 (UTC)
Doesn't AWB already do that internally? I see what you're doing... you're adding spaces in those conversions. If you wanted to expand that rule though you could do: "(\b\d+)\s*m(etere?s)?(/| per | a )s(econd)\b" and replace it with "$1 m/s", although that's more expensive.
Shadowjams (
talk)
23:46, 10 August 2010 (UTC)
I've no idea why you'd want to clutter the regex that way, but I ain't the AWB guru, so what do I know. Use whatever works, I'll be happy with it. Also this should just cover the symbols, and not the words "metres/second", the point is to add the non-breaking space in before m/s.Headbomb {
talk /
contribs /
physics /
books}07:10, 11 August 2010 (UTC)
Is it entirely a good idea to correct the phrases at the bottom of the project page? If they were part of a quote, they would not need a sic tag since they are technically not incorrect. An editor may not notice they have correct something that should not have been corrected. McLerristarr|Mclay123:40, 17 August 2010 (UTC)
When typo fixing all editors have to look out for untemplated quoted material. For such situations if there are problems {{sic}} can be used in hidden mode.
Rjwilmsi11:27, 18 August 2010 (UTC)
Fixing decent --> descent
This is a surprisingly common misspelling, in phrases like ".. he is of Asian decent .." , but obviously isn't suitable for a general typo fix. However, I think a regex to pick up anything of the form "of U(.*?)(an|ish|ic) decent". (where U represents an uppercase character) would find most of them without any false positives. My regex skills aren't up to it though - could someone more knowledgeable add this to the list?
Colonies Chris (
talk)
11:08, 18 August 2010 (UTC)
So you're going to follow the guidance of a single person at wiktionary who says it's "considered more correct by most authorities" (without a reference to even a single "authority") instead of Merriam-Webster and the OED? Maybe you want to check back with that wiktionary person first.--
BillFlis (
talk)
01:48, 19 August 2010 (UTC)
The free Oxford online dictionary says "Propeller can also be spelled propellor: both are correct, but propeller is much more common." McLerristarr|Mclay111:09, 19 August 2010 (UTC)
masturbatch
Resolved
The "masturbate"-rule,<Typo word="Masturbate" find="\b(M|m)asterbat(\w+)\b" replace="$1asturbat$2" />
, tried to change
masterbatch to masturbatch. I found "masterbatch" on
five pages. Is that enough to add an exception? I'm not quite sure how to do that myself.--
ospalh (
talk)
11:57, 20 August 2010 (UTC)
"<Typo word="Commemorate" find="\b(C|c)ommerat(es|ed|ing|ions?)\b" replace="$1ommemorat$2" />
": Is "commerates" &c. really the most common misspelling? I thought things like "comemorate" (one m before e) or "comemerate" (e instead of o) would be more common. "<Typo word="Commemorate" find="\b(C|c)om{1,2}e(?:mo|me)?rat(e|es|ed|ing|ions?)\b" replace="$1ommemorat$2" />" would find all of these, but would also change "comerates" to "commemorates". "Comerates" is a bit too close to "Comrades" for my taste. So, "<Typo word="Commemorate" find="\b(C|c)om{1,2}e(?:mo|me)rat(e|es|ed|ing|ions?)\b" replace="$1ommemorat$2" />" would fix "comemorate" and "commemerate", but not "comerates". Any thoughts?--
ospalh (
talk)
11:52, 25 August 2010 (UTC)
(Note to self: research before you type) Looks like a) "commerates" etc. is somewhat common, but b) there seems to be an actor called "
Sheridan Comerate", so 'find="\b(C|c)om{1,2}e(?:mo|me)?rat(e|es|ed|ing|ions?)\b"' would give some false positives and 'find="\b(C|c)om{1,2}e(?:mo|me)rat(e|es|ed|ing|ions?)\b"' would miss some misspellings.--
ospalh (
talk)
12:01, 25 August 2010 (UTC)
Is this worth it? The most common matches seem to be "most earliest", "most holiest", and "most costliest" (not necessarily in that order).
PleaseStand(talk)19:15, 25 August 2010 (UTC)
Ah, I see. I thought you meant is it worth keeping, as in you wanted to delete it. My mistake. One of the many problems of communicating by text. McLerristarr /
Mclay109:28, 27 August 2010 (UTC)
It's not always going to work as intended: When "Most" is capitalized, the adjective after correction will not be (will remain as it was). I would leave out the "M"; the error will probably be preceded by "the" anyway.--
BillFlis (
talk)
13:10, 27 August 2010 (UTC)
I'll let others opine on if there's some risk of a false positive, but this should do it: <find="\b(M|m)elbo(rn|unr)e\b" replace="$elbourne" />. It should catch "Melborne" and "Melbounre" and will capitalize any lower case versions.
Shadowjams (
talk)
07:01, 24 August 2010 (UTC)
Could someone please add the plural form of
Phenomenon? It should be "Phenomena" but a very common misspelling is "Phenonema", with only two letters, the n and the m, switched around, making it very hard to spot. There's also a fairly large amount of
search results in Wikipedia for this misspelling. I checked the current entry for "Phenomenon" in the list, and I do believe it does not take into account this particular misspelling of the plural form. --
Ϫ23:14, 29 August 2010 (UTC)
For a feature request I added the capability for AWB to hide text in italics as part of its HideMore() function ('Ignore templates, refs, link targets...'). Do we want hiding of italics on or off for typos? We already hide untemplated quotes (text between " and related curly quotes).
Rjwilmsi09:01, 30 August 2010 (UTC)
Sometimes we use italics to emphasise a word or a sentence. Italics are used for many reasons. Typo fixing should apply inside italics exactly the same way it applies outside them. --
Magioladitis (
talk)
09:03, 30 August 2010 (UTC)
Was the original concern over foreign and proper terms (like book/movie titles) or is there something else I'm not thinking of?
Shadowjams (
talk)
18:23, 30 August 2010 (UTC)
I see. I tend to agree with Magioladitis on this point, there're a lot of these that fit within typo territory, but perhaps it cuts down on false positives. Just something to be aware of, it's obviously not an ideological issue.
Shadowjams (
talk)
08:51, 31 August 2010 (UTC)
The guideline is laid out here:
Wikipedia:APOSTROPHE#Possessives. If you pronounce "series'[s] antagonist" as "sireez antagonist", then Wikipedia says not to use the additional s. On the other hand, it says if there are two possible pronunciations, you can use either. I definitely pronounce the phrase "series's antagonist" as "sireeziz antagonist". — the Man in Question(in question)17:07, 5 September 2010 (UTC)
I've removed the rule. Per the guidelines on apostrophes, both versions are potentially correct, as long as usage is consistent (with the 's, without the 's, or with the 's if pronounced as iz) on a given article. --
JHunterJ (
talk)
11:29, 6 September 2010 (UTC)
I can't find the rule that would make such a change, and I can't find any instances of "heaively" (or "heaivly", which seems more likely) in wikipedia. It looks like it's no longer a problem.--
BillFlis (
talk)
11:19, 13 September 2010 (UTC)
Either bender's original post has a typo, or it's replacing "heaively" with itself, which I too can't find a rule that would do. Perhaps you meant it was replacing "heavily" with "heaiviley", which would make sense given this rule: <Typo word="-ively" find="\b(\w+)ivly\b" replace="$1ively" />. Before changing that, beware that "ively" is an equally, if not more, common version of that ending. Anyone have ideas about how to distinguish which ending is right based on the base?
Shadowjams (
talk)
17:40, 13 September 2010 (UTC)
Alternation vs. character classes
Hall with Schwartz calls using alternation (A|a) instead of character class [Aa] a "classic mistake" in Effective Perl Programming, and that it takes a speed penalty, perhaps on the order of 4x. Maybe the processing here has gotten smarter since then, and it does save characters when capturing, (A|a) instead of ([Aa]), but we may still want to change it back. --
JHunterJ (
talk)
19:25, 13 September 2010 (UTC)
ISBN0596528124 page 237 has a benchmark for .NET that lists character classes as being 4.7x faster. I don't know how old that is... but worth considering. There are probably other optimizations like this as well.
Shadowjams (
talk)
00:40, 14 September 2010 (UTC)
VB.NET, we use C#: I profiled 1000 replace operations for "\b(R|r)ec(?:ie|ei?)pient(s?)\b" and "\b([Rr])ec(?:ie|ei?)pient(s?)\b" (details on request) and the numbers were 13463 and 12860 ms respectively i.e. around a 5% difference only. So I conclude there's not much difference for C#. We cannot take a 4x or 5x difference in another language and assume it applies for ours.
Rjwilmsi20:54, 14 September 2010 (UTC)
Wow that is common. Added a rule
here. I looked around in a few dictionaries thinking it might be an alternative spelling just based on how common it is, but I couldn't find anything. DoneShadowjams (
talk)
15:28, 14 September 2010 (UTC)
This edit correctly changed "67 Kg" and "800 Km" to "67 kg" and "800 km". However, the edit summary reads (
Typo fixing, typos fixed: 7 Kg → 7 kg (2) using
AWB).
Anyone want to try updating the
rule to make the edit summary better? Thanks!
GoingBatty (
talk)
04:49, 14 September 2010 (UTC)
One could make the summary more accurate by putting a quantifier (+ in this case) on the \d in the rule, but that would increase the time (infinitesimally, albeit) the regex runs across every page scanned. It probably doesn't matter either way; if you want to put it in there that's how one would do it.
Shadowjams (
talk)
05:48, 14 September 2010 (UTC)
Actually, on second look, that's not a Typo rule, that's a built-in program rule. I'm guessing that internal rule uses regex too though, so the same applies.
Shadowjams (
talk)
05:51, 14 September 2010 (UTC)
No, it is a typo issue. My second point was wrong (Rjwilmsi was correcting me). I was confused because I was looking for a rule that would add   to the output, and there isn't a rule that did that (that part is internal). However, there is a rule that did the capitalization, and updating that, would fix the OP's issue. It's this one: <Typo word="kg/km (kilogram/kilometer)" find="(\d(?:\s| |-)?)K(g|m)\b" replace="$1k$2" />.
Change it to <Typo word="kg/km (kilogram/kilometer)" find="(\d+(?:\s| |-)?)K(g|m)\b" replace="$1k$2" /> and you've fixed the issue (see above for speed considerations).
Shadowjams (
talk)
16:45, 14 September 2010 (UTC)
All of the rules have been updated with the +. Now I see in
this edit that AWB accurately changed "16KHZ" → "16 kHz", but the edit summary says: (Typo fixing, typos fixed: 16KHZ → 16kHz using AWB) (without the space)
GoingBatty (
talk)
03:27, 17 September 2010 (UTC)
Another very common misspelling (over 2000 search results!) Including supressed/supressing/supression and whatever other prefixes there are. I'm surprised this one wasn't in there already..
Actually I did find "(Immuno)Suppress" in the list, but that doesn't seem correct.. it's already got the double-p, so maybe that's just a mistake? or what, but I don't know if maybe the (Immuno) part is affecting the detection somehow too.
Opress --> Oppress is another one we could add, that one is a bit less common but still coming up in search results. Except that the search results come up with the false positive "of-press" for some reason, which is slightly annoying, but I don't think that would affect AWB's typo detection anyway. --
Ϫ22:50, 15 September 2010 (UTC)
Oh! okay. These regexes still confuse me. :) But, is it normal for there to still be so many
existing misspellings? I thought that once a typo gets added to the list they usually all get fixed pretty quickly.. Is it just that noone has patrolled these articles yet with AWB? --
Ϫ17:05, 16 September 2010 (UTC)
Inconsistent use of formats such as '(C|c)' and '[Cc]'. Propose change all to '[Cc]'
The list is inconsistent in whether the regex uses '(C|c)' or '[Cc]'. I propose running a changing them all to the format '[Cc]'. It's trivial but using the same format makes it slightly easier to notice the real differences. Any objections?
Lightmouse (
talk)
15:15, 17 September 2010 (UTC)
They are not equivalent. "(C|c)" is equivalent to "([Cc])". Also, I know there was some discussion about speed, but a more important consideration might be space. This page is already huge, and changing every instance of this would add another character to each of the affected rules, which is the large majority of them.--
BillFlis (
talk)
18:54, 17 September 2010 (UTC)
You're quite right, the pairings are '(C|c)' with '([Cc])', or '(?:C|c)' with '[Cc]'. I agree that compact code is a good thing. I'll leave it to you. Incidentally, I'm sure there are more units of measure that would be useful, also I only see one square unit of length and there could be cubes too.
Lightmouse (
talk)
20:23, 17 September 2010 (UTC)
Bill sums up the issue exactly. I can see positives to both. In some ways I think ([Cc]) is conceptually clearer, but that's a personal preference. I made the changes to all of the New additions thinking the speed tradeoff was more important than later testing demonstrated. There is 1 character difference between the two; I don't see any reason to prefer one over the other. I think it's best to leave them as they're originally created, with whatever idiom the creator chooses.
Shadowjams (
talk)
21:58, 17 September 2010 (UTC)
I took another look. What it's doing is it's looking for anything with an "Etc" followed by something that's not either a period or a word character (0-9,a-z). In the case of "etc....." it's skipping it because there's already a period, and not looking at the rest. This is intentional for two reasons. One, it terminates the search early on correct matches (which are the majority) and saves processing time, and second, it allows for unanticipated but correct uses, like an ellipsis. It not fixing "etc" is related... because there's nothing following the c, it doesn't catch. However, in a real article etc won't be alone. It will be followed by something: "etc more words". This sometimes comes up in testing. We try to design rules so they don't catch on correct spellings (even if they correct them back to themselves) because I assume they take more processing (they run entirely, as opposed to stopping midway through). Maybe that's unnecessary, but most of the rules adhere to that format.
Shadowjams (
talk)
22:10, 17 September 2010 (UTC)
I appreciate your reply. I made this request because I thought that "etc." plus an ellipsis was not a correct use. Why would an ellipsis be necessary? Thanks!
GoingBatty (
talk)
15:26, 19 September 2010 (UTC)
That's a good point. I tended towards the cautious with some of these when I started, and I added the etc. rule that's currently in use (although there was a simpler one earlier) earlier on. I think the change you're talking about would be fine.
Shadowjams (
talk)
05:12, 20 September 2010 (UTC)
Thanks Shadowjams. I was playing around with how to edit the rule to fix "etc....", but couldn't get it to skip "etc." Could you please help me with this? Thanks!
GoingBatty (
talk)
17:07, 20 September 2010 (UTC)
There's another problem with that though. The - needs to be at the end of the class, otherwise it's looking for a range. I'm not sure what it does in that case, but it might explain any strange effects you're seeing.
Shadowjams (
talk)
21:02, 19 September 2010 (UTC)
Interesting. That's actually a little new... it doesn't work with
grep for instance. Perl calls this
version 8 regex (I think). Apparently - at either the beginning or end is fine, but in the middle, of course, it's ambiguous.
Shadowjams (
talk)
06:17, 20 September 2010 (UTC)
Although there's an existing rule for "Hungary" that includes "Hungarian", it doesn't want to fix "hungarian" and "hungarians" in
Culture of Hungary. When I tried the rule in the AWB Regex tester, it seems to work fine. Any ideas?
GoingBatty (
talk)
04:22, 20 September 2010 (UTC)
Typo fixing rules are not applied when a wikilink target also matches on the typo rule in order to avoid false positives on uncommon names etc. In this case there's an image linked in the article with a lowercase 'hungarian' in the file name, hence the typo fix is not applied. From looking at the Commons:File Renaming page it would appear that asking for the file to be renamed might be refused. I've now applied the typo fixing to the article. Feel free to try to get the image renamed.
Rjwilmsi16:29, 23 September 2010 (UTC)
I limited the rule for "Critical", which was evidently making this change, to not make this particular change. We'll need a new rule to correct "critiziced" to "criticized", which I was surprised to find has more than a dozen occurrences on wikipedia.--
BillFlis (
talk)
16:22, 23 September 2010 (UTC)