User:Legoktm/Fixing lint errors

From Wikipedia, the free encyclopedia

Back in 2016 I was a member of the WMF Parsing Team and wrote the initial MediaWiki side of Linter (the initial Parsoid side had been written 2 years earlier by Hardik95 in a GSoC project). It was a significant advance in allowing editors to clean up wikitext to move the language forward. At the time, the main motivation was to replace Tidy, which is now complete. I no longer work at the WMF, but am a {{friend}} of the Parsing Content Transform Team and believe the work they do is some of the most beneficial to users.

Mission accomplished?[edit]

We replaced Tidy and the sky didn't fall, are we good now?

In my opinion, Linter has not yet reached its full potential, largely because we have an existing backlog of issues to be fixed first. In general we allow people to make mistakes while editing and then track them in categories or database reports for cleanup, whether it be using a template wrong or something technical, like nesting a <div> inside a <span>.

Some templates emit visible (and hidden) warnings, but they're easy to miss since they're not integrated natively into most editors and most users wouldn't notice that they've misnested some HTML tags because it displays just fine, despite being wrong.

Other templates and markup patterns render text incorrectly, such as when {{talk quote inline}} contains multiple paragraphs of text, or {{reflist-talk}} is indented, or editors put six ' in a row in a table cell, intending to leave a blank space for bold text to be filled in later. When a font tag is used to wrap a wikilink, the editor's intended color is not applied. A missing div tag can render the rest of a talk page's sections inside a blue box that should have been closed in its own section. In extreme cases, invalid markup can make the rest of a page's text smaller, or green, or rendered with strikeout markup.

Linter provides a framework for documenting where issues are in wikitext, making it easier for tools to highlight them for fixing. It does not (yet) allow on-wiki editors to create their own errors, but I think it would be a good future step.

What should check for lint errors?[edit]

Just about everything. To aid with DiscussionTools, new requirements were instituted that forbade signatures from having most lint errors (obsolete tags remain as allowable signature markup). Bots and other automated processes should check there are no new lint errors before saving pages. It sure would be nice if we never had to clean up mass messages that didn't close tags properly. Editing interfaces should flag new lint errors to experienced editors so they can fix them before saving. (To be clear, I don't think we should stop human editors from making edits that introduce new errors, just that we should flag it for them and give them the option to fix it themselves.)

But it's hard to make this a requirement yet because we have so many outstanding issues. Exposing this to editors or stopping bots from editing if they hit errors simply isn't possible yet because there's too much noise. We need to drastically reduce the number of lint errors before such an option is realistic.

But obsolete HTML tags[edit]

If we ignored obsolete HTML tags, yes, the backlog would be much smaller, probably half of the current size.

I have previously noted that it's unlikely these obsolete elements will ever be dropped and if so, we'll have a good amount of notice from major browser maintainers. And even if browsers did drop support for tags, if we wanted to, we could re-implement them in MediaWiki.

There is value gained from fixing obsolete HTML tags. In no specific order:

  • It aligns us with HTML5 efforts in general, just like our move away from Tidy.
  • The obsolete tags are obsolete for a reason, they have really weird behavior. For example <font color="coal"> turns into #c0a000 (c and a are the only hex characters), which is not coal-colored at all. In comparison, <span style="color: coal;"> doesn't do anything because coal is an invalid color.
    • <center> will center blocks and text, unless they're in a table in which case it doesn't center text.
    • And so on. Searching the internet finds other reasons like accessibility for why things were deprecated.
  • People tend to copy and paste things they find in other wiki pages (I certainly do!). Fixing things will prevent people from introducing new issues by copy-paste.
  • While browsers already and will continue support legacy tags, it's much simpler for people to develop tools if they don't need to implement support for all types of legacy behavior.
    • Imagine a tool that automatically checked pages use of colors for appropriate contrasts (similar to {{Ensure AAA contrast ratio}}). Such a tool would only need to look at standard CSS/inline styles rather than implementing legacy <font> behavior.

OK, but are those worth making edits to a bunch of pages that are just for archival and no one really cares about or will ever look at again? Sure. I don't see this as any different from updating deprecated template parameters or merging duplicate templates. We could easily continue to support those, but it adds a maintenance cost, so we clean them up. HTML tags are roughly the same. We don't absolutely *have* to, but we do to simplify things.

Because this cleanup has a much larger scope than templates, requiring edits to millions of pages, it's likely it will cause more disruption for editors in their watchlists and other patrolling tools. I expect these notifications will decrease over time as the error backlog is cleared, people get used to watching pages on a temporary basis rather than indefinitely, and we get better at using change tags to mark specific types of edits that editors can choose to hide.

Enabling a better paradigm of bots and tools[edit]

Most bots and tools operate by parsing and manipulating wikitext. Many use regexes and others use hacky parsers (e.g. mwparserfromhell) and as a result, don't benefit from advances made over the past decade in Parsoid. For example, mwparserfromhell has code to guess the whitespace formatting of templates to match the existing spacing when changing parameters. We don't really need to do that though: templates now have TemplateData that specify how a template should be spaced; using Parsoid seamlessly handles that for you. No need to worry that wikitext can be context-sensitive, because you can just examine and manipulate the HTML DOM and let Parsoid take care of it all.

However, operating on the HTML DOM works only if the conflicts with HTML are fixed, so we do need to clean up these lint errors. And of course, doing so helps normal editors that use VisualEditor, editProtectedHelper, and more.

There is no shortage of tasks that need attention from bot authors, so advances that simplify the process of creating and maintaining bots will be huge wins in the long run.