A lightweight and focused scraper that generates detailed HTML validity reports for web pages. It helps developers and SEO teams quickly identify markup issues and standards compliance gaps using reliable validation logic. Ideal for anyone who cares about clean, future-proof HTML.
Created by Bitbash, built to showcase our approach to Scraping and Automation!
If you are looking for w3c-html-reporter you've just found your team — Let’s Chat. 👆👆
This project analyzes web pages and produces structured reports describing how well their HTML complies with official standards. It solves the problem of manually checking markup validity and interpreting raw validator feedback. The scraper is built for developers, QA engineers, and site owners who want clear, actionable insights.
- Detects errors and warnings that can affect rendering and accessibility
- Helps maintain cross-browser compatibility
- Improves long-term maintainability of web projects
- Supports SEO and technical audits with concrete data
| Feature | Description |
|---|---|
| URL-based validation | Analyze one or multiple web pages by URL. |
| Detailed messages | Captures info, warnings, and errors with precise locations. |
| Language awareness | Preserves language context reported by the validator. |
| Structured output | Produces clean, machine-readable JSON results. |
| Debug mode | Enables verbose logging for troubleshooting and analysis. |
| Field Name | Field Description |
|---|---|
| url | The validated webpage URL. |
| language | Language detected for the page or message. |
| severity | Message level such as info, warning, or error. |
| message | Human-readable explanation of the validation issue. |
| firstLine | Line number where the issue starts. |
| lastLine | Line number where the issue ends. |
| firstColumn | Column position of the issue start. |
| lastColumn | Column position of the issue end. |
| markup | HTML snippet related to the issue. |
| highlightIndex | Index used to highlight the issue. |
| highlightLength | Length of the highlighted markup. |
[
{
"url": "https://apify.com",
"language": "en",
"severity": "info",
"lastLine": 10,
"firstColumn": 301,
"lastColumn": 357,
"message": "Trailing slash on void elements has no effect and interacts badly with unquoted attribute values.",
"markup": "rowser.\"/><meta name=\"twitter:card\" content=\"summary_large_image\"/><meta ",
"highlightIndex": 10,
"highlightLength": 57
}
]
W3C Html Reporter/
├── src/
│ ├── index.js
│ ├── validator/
│ │ ├── htmlValidator.js
│ │ └── messageParser.js
│ ├── config/
│ │ └── defaultConfig.json
│ └── utils/
│ └── logger.js
├── data/
│ ├── sample-input.json
│ └── sample-output.json
├── package.json
└── README.md
- Frontend developers use it to validate pages early, so they can ship cleaner and more stable HTML.
- SEO specialists run it during audits to uncover markup issues that may affect indexing.
- QA teams integrate it into checks, ensuring standards compliance before release.
- Agencies apply it across client sites to standardize technical quality reviews.
Does this scraper validate JavaScript-rendered content? It validates the HTML as served at the time of request. If content is rendered client-side, ensure the final HTML is accessible to the validator.
Can I validate multiple URLs in one run? Yes, the scraper accepts a list of URLs and processes each independently.
What types of issues are reported? The output includes informational notes, warnings, and errors exactly as classified by the validator.
Is this suitable for CI pipelines? Yes, the structured JSON output makes it easy to integrate into automated quality checks.
Primary Metric: Processes an average page validation in under 2 seconds.
Reliability Metric: Maintains a success rate above 98% across diverse websites.
Efficiency Metric: Handles dozens of URLs per minute with minimal memory overhead.
Quality Metric: Reports all validator messages with full positional accuracy and context.
