Quick way to find a value in HTML (Java)

Tags: regex html java
By : pek
Source: Stackoverflow.com
Question!

Using regex, how is the simplest way to fetch a websites HTML and find the value inside this tag (or any attribute's value for that matter):

<html>
  <head>
  [snip]
  <meta name="generator" value="thevalue i'm looking for" />
  [snip]
By : pek


Answers

You should be using XPath query. It'ls as simple as getting value of "/html/head/meta[@name=generator]/@value".

a good tutorial: Parsing an XML Document with XPath

By : Vardhan


Its amazing how noone, when adressing the problem of using RegEx with HTML, confronts the problem of HTML often NOT being well-formed, thus rendering a lot of HTML-parsers completely useless.

If you are developing tools to analyze webpages and its a fact that these are not well-formed HTML, the statement "Regex should never be used to parse HTML" og "use a HTML parser" is just completely bogus. Facts are that in the real world, people create HTML as they feel like - and not necessarily suited for parsers.

RegEx is a completely valid way to find elements in text, thus in HTML. If there are any other reasonable way to confront the problems the Original Poster has, then post them instead of refering to a "use a parser" or "RTFM" statement.



It depends.

If you are extracting information from a site or sites that are guaranteed to be well-formed HTML, and you know that the <meta> won't be obfuscated in some way then a reading the <head> section line by line and applying a regex is a good approach.

On the other hand, if the HTML may be mangled or "tricky" then you need to use a proper HTML parser, possibly a permissive one like HTMLTidy. Beware of using a strict HTML or XML parser on stuff trawled from random websites. Lots of so-called HTML you find out there is actually malformed.

By : Stephen C


This video can help you solving your question :)
By: admin