The HTML used by PDF.js is in no way a sane "human readable structured data" format. But a "browser compatible" ad-hoc representation for the underlying PDF representation.
Lastly PDF are not formats for structured data, which is what XML is all about in the end.
So I have no idea why you thing HTML+Js is a sane replacement for XML use cases, it isn't. It isn't a replacement to a degree that IMHO it doesn't even really makes sense (the JS part).
XML is a defective technology that has no place in the modern web. Perhaps the only successful bit is SVG, for lack of a better alternative. MathML is a failure — katex and mathjax do a fundamentally better job of rendering mathematics on the web — and are based on what people who write a lot of math actually use: tex and friends.
If you need to interpret XML documents as HTML, use some javascript. The attack surface reduction of eliminating XPath, XHTML, XSLT and other mistakes like microformats is worth it alone.
I stand by this, having implemented enough of XPath, XML 1.1 and XSLT to implement WS-Security from scratch (have fun with c14n!).
The sooner we move on from the failed experiment of XHTML, the better. The idea that the browser is the means of extending the core document model is gone; most if not all power resides in the JS engine. If it makes sense to stick in the core browser engine, then it will be obvious when that is so via usage statistics.
You can compile libxml2 to wasm if you must (i’ve done this when I needed a more complete XPath implementation)
XML is rather successful outside of the web in spite of all the vitriol poured on it. And modern web is under a specific and unique combination of pressures to serve as a reference for the rest of software development. If you looking for something that deserves the name of "defective" you don't need to go any further.
For example, the problem of math is not only to render it in the browser. At the very least we may want to render it on the server and to index it. And since math expressions are in a document, there's a general need to process them programmatically for a variety of purposes, some of which are not even clear at the moment. With the KaTeX or MathJax solution each such scenario would have to include KaTeX or MathJax or a custom parser for the underlying TeXlike language the only upside of which is that it's somewhat well-known and more or less easy to write. With MathML these and other scenarios can be handled with the standard XML toolchain. (And this doesn't mean we need to exclude that neat TeXlike language if we need to input it: we only need to add a step that transforms it into MathML.) MathML, is, of course, not simple, but it addresses both presentational and semantic sides of a formula, something that no other solution does. It's complex because the math is complex.
Dude you're delusional. The web only succeeded because it was built on declarative tech, and JS is an the opposite of that.
Don't mix the WS-* trash with XPath/XSLT -- still the only standard data transformation technology. Last I check the JSON folks were still trying to reinvent XML schema? The JSON stack has nothing on XML in terms of maturity and features.