Jsoup (opens new window) is a library allowing us to
- parse HTML from an URL, file or string
- Extract data from the DOM, i.e. by using CSS selectors
- Manipulate HTML elements, attributes and texts
- Generate clean HTML
In this Jsoup tutorial, I am going to show you how to parse a website and extract data.
First off, you need to include the jsoup dependency. For news, changelog and bug reports, check out the official Jsoup page (opens new window).
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.10.2</version>
</dependency>
# Parsing a document
Start by parsing HTML into a Document. Henceforth, we are working with that Document. Jsoup is called statically. Jsoup offers the following variants of parsing HTML.
Jsoup - Methoden zum Parsen eines Dokumentes
To parse a website to a document, you can simply call the following method
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
public class WebsiteParser {
private static final int TIMEOUT\_IN\_MS = 5000;
public static void main(String\[\] args) throws MalformedURLException, IOException
{
Document doc = Jsoup.parse(new URL("url"), TIMEOUT\_IN\_MS);
System.out.println(doc.html());
}
}
# Extract Data
Once we parsed the HTML to a document, we can start extracting data. Jsoup offers quite a few ways to do so, for example...
- getElementById - Returns a HTML element (org.jsoup.nodes.Element) by the given ID
- getElementsByTag - Returns a list of HTML elements (org.jsoup.select.Elements) with the given tag
- getElementsByClass - Returns a list of HTML elements (org.jsoup.select.Elements) with the given CSS class
- getElementsByAttribute - Returns a list of HTML elements (org.jsoup.select.Elements) containing the given attribute
- select - Returns a list of HTML elements (org.jsoup.select.Elements) matching the given CSS selector
You can also apply further selectors on org.jsoup.select.Elements, making the following example valid
Elements tables = doc.getElementsByAttribute("table");
for(Element table : tables)
{
Elements tableRows = table.getElementsByAttribute("tr");
for(Element tableRow : tableRows)
{
Elements tableData = tableRow.getElementsByTag("td");
// ...
}
}
# CSS Selector
As mentioned previously, the select method makes it possible to select HTML elements by providing a CSS selector.
Elements links = doc.select("a[href]"); // Get all links with a href attribute present
Elements gifs = doc.select("img[src$=.gif]"); // Get all images that are gifs
Element tableRows = doc.select("table.versions > tbody > tr").first(); // Get all table rows "tr" within a tbody element, in a table element with the class versions
Elements resultLinks = doc.select("span > a"); // Get all links within a span element
Elements lastTableChild = doc.select("table > tbody > tr:last-child"); // Get the last child of a table row within a table body within a tables
# Overview of selectors
tagname
: Finds the element with the given tag, i.e. a or divns|tag
: Finds the elements from a given tag in a specific namespace, fb|name matches fb:name elements#id
: Finds elements by the given ID, i.e. #eow-title.class
: Finds elements with the given class, i.e. .versionsfinds all elements with the versions class[attribute]
: Finds all elements with the given attribute, i.e. [href] or [name][^attr]
: Finds all ements with the given prefix as attribute name, i.e. [^data-]findet alle elements with HTML5 data-attributes
[attr=value]
: Finds all elements with the given attribute and value, i.e. [width=500][attr^=value]
,[attr$=value]
,[attr*=value]
: Finds all elements, that start/end/contain the attributes value, i.e. [href*=/path/]finds all elements, whose href attributes contain /path/ in it's value
[attr~=regex]
: Finds all elements, matchign the given regex, i.e.img[src~=(?i)\.(png|jpe?g)]
*
: Finds all elements
# Combining selectors
el#id
: Elements with the ID, i.e. div#logoel.class
: Elements with the class, i.e.div.masthead
el[attr]
: Elements with the given attribute, i.e.a[href]
- Concatted, i.e.
a[href].highlight
ancestor child
: Child elements, that have a specific parent, i.e..body p
finds allp
elements, that are children (not just directly), of an element with a body classparent > child
: Finds the direct child element, i.e.div.content > pfinds all p elements, that are direct children of a div with the class content
el, el, el
: Group multiple selectors, i.e.div.masthead, div.logo
List of all selectors on Jsoup.org (opens new window)
# Real-World example - Parsing dependency versions from mvnrepository.com
Theory is one thing, but let us see Jsoup in action with a real-world example.
Let's try parsing the dependency versions from a library, let's take Async-Http-Client from com.ning (opens new window).
This is how the source code of the page looks like:
<table class="grid versions" width="100%">
<thead>
<tr>
....
</tr>
</thead>
<tbody>
<tr>
<td rowspan="56">
<div><span><b>1.9</b>.x</span></div>
</td>
<td><a href="/artifact/com.ning/async-http-client/1.9.30" class="vbtn release">1.9.30</a></td>
<td>
<div><a href="/artifact/com.ning/async-http-client/1.9.30/usages">2</a><span class="rankingbar" style="width: 7px; "></span></div>
</td>
<td>release</td>
<td> (Jul, 2015) </td>
</tr>
...
</tbody>
</table>
We want to parse the exact version (1.9.30), the type (release) - could be release, beta, general availability or something else and the usages (2).
Let's create a POJO representing a dependency.
public class MavenDependency {
private String group;
private String artifact;
private String version;
private int usages;
private String type;
// Konstruktor
// Getter & Setter
// ToString
}
Start by parsing the website
// Parse website with a 5 secodn timeout
Document document = Jsoup.parse(new URL("http://mvnrepository.com/artifact/com.ning/async-http-client"), 5000);
Using the css selector table.versions > tbody > tr, we get all rows containing the versions. It is important to select the tbody, else we wil also get unnecessary information from the tables's head.
Elements versionRows = document.select("table.versions > tbody > tr");
Now that we have the rows, we continue extracting the exact information for a single dependency entry by applying CSS selectors.
- Version: td > a.vbtn - Link with the class vbtn in a td element
- Usages: td:nth-last-child(3) - The third last child from the td element
- Type: td:nth-last-child(2) - The second last child from the td element
There are probably a bunch of other ways to access this data. This is how the entire code looks like
public class MavenParser {
private static final String MVN\_REPO\_BASE\_SEARCH\_URL = "http://mvnrepository.com/artifact/";
public static void main(String\[\] args) throws MalformedURLException, IOException {
List<MavenDependency> availableDependencies = getMavenResults("com.ning", "async-http-client"); // -> http://mvnrepository.com/artifact/com.ning/async-http-client
availableDependencies.forEach(System.out::println);
}
public static List<MavenDependency> getMavenResults(String group, String artifact)
throws MalformedURLException, IOException {
String url = MVN\_REPO\_BASE\_SEARCH\_URL + group + "/" + artifact;
// Parse website with 5 seconds timeout
Document document = Jsoup.parse(new URL(url), 5000);
// All versions as table rows
Elements versionRows = document.select("table.versions > tbody > tr");
List<MavenDependency> mvnResults = new ArrayList<>();
// Parse a single table row
for (Element tr : versionRows) {
String version = tr.select("td > a.vbtn").get(0).text();
int usages = Integer.parseInt(tr.select("td:nth-last-child(3)").get(0).text());
String type = tr.select("td:nth-last-child(2)").get(0).text();
mvnResults.add(new MavenDependency(group, artifact, version, usages, type));
}
return mvnResults;
}
}
If you like this post, feel free to follow me or hit me up on Twitter (opens new window).