jsoup介绍 | iBit程序猿

jsoup 是一个用于处理 HTML 的 Java 库。它使用 HTML5 最佳 DOM 方法和 CSS 选择器，为提取 URL 以及提取和处理数据提供了非常方便的API。

jsoup 实现 WHATWG HTML5 规范，并将HTML解析为与现代浏览器相同的DOM。

从URL、文件或字符串中抓取并解析HTML
使用 DOM 遍历或 CSS 选择器查找和提取数据
处理 HTML 元素、属性和文本
根据安全的白名单清除用户提交的内容，以防止XSS攻击
输出整洁的 HTML

jsoup旨在处理和发现的所有各种HTML；从原始和验证到无效的 tag； jsoup将创建一个明智的解析树。

示例

获取 Wikipedia 主页，将其解析为DOM，然后从“新闻中”部分的标题中选择元素列表（在线示例，完整源代码）：

Document doc = Jsoup.connect("https://en.wikipedia.org/").get();
log(doc.title());
Elements newsHeadlines = doc.select("#mp-itn b a");
for (Element headline : newsHeadlines) {
  log("%s\n\t%s", 
    headline.attr("title"), headline.absUrl("href"));
}

开源的

jsoup 是一个开放源代码项目，根据 MIT的自由许可证进行分发。源代码可从 GitHub 获得。

下载和安装 jsoup

jsoup 是可下载的 .jar Java 库。当前发行版本是 1.13.1。

jsoup-1.13.1.jar 核心类库
jsoup-1.13.1-sources.jar 可选源代码 jar
jsoup-1.13.1-javadoc.jar 可选javadoc jar 包

更新内容

有关最新更改，请参见1.13.1发行公告；有关完整历史记录，请参见更改日志。

使用早期版本的 jsoup。

Maven

如果你使用Maven来管理 Java 项目的依赖关系，你无需下载；只需将以下内容放入POM的部分：

<dependency>
  <!-- jsoup HTML parser library @ https://jsoup.org/ -->
  <groupId>org.jsoup</groupId>
  <artifactId>jsoup</artifactId>
  <version>1.13.1</version>
</dependency>

Gradle

// jsoup HTML parser library @ https://jsoup.org/
compile 'org.jsoup:jsoup:1.13.1'

源码构建

如果你想尝试尚未发布的更改，或者想要进行自己的更改，则需要从源代码构建一个jar。这很简单。最好使用git，以便你可以保持最新状态，并能够将所做的更改反馈给你：

git clone https://github.com/jhy/jsoup.git
cd jsoup
mvn install

这将运行单元测试和集成测试，并在通过后将快照 jar 安装到本地Maven存储库中。

如果您不想使用git，则可以下载一个zip文件：

curl -Lo jsoup.zip https://github.com/jhy/jsoup/archive/master.zip
unzip jsoup.zip
cd jsoup-master
mvn install

依赖

jsoup 完全是自包含的，没有依赖性。

jsoup 可在Java 7及更高版本，Scala，Kotlin，Android，OSGi，Lambda 和 Google App Engine 上运行。

Cookbook 内容

简介

解析 HTML 文档

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

查看从字符串解析文档获取更多信息。

解析器将尽一切努力从您提供的 HTML 创建干净的解析，无论HTML是否格式正确。它处理：

未关闭的标签（例如 “ Lorem Ipsum” 解析为 “ Lorem Ipsum ”）
隐式标签（例如，将裸 “<td> Table数据</td>” 包装到 “<table> <tr> <td> ...” 中）
可靠地创建文档结构（html 包含头部和主体，并且头部中仅包含适当的元素）

文档的对象模型

文档由 Element 和 TextNode（以及几个其他节点：请参见节点包树）组成。
继承链为：Document 继承 Element 继承 Node。TextNode 继承 Node。
一个 Element 包含一个子节点列表，并具有一个父 Element。它们还仅提供子 Element 的过滤列表。

输入

从字符串解析文档

静态 Jsoup.parse(String html) 方法, 或如果页面来自网络并且你要获取绝对URL Jsoup.parse(String html, String baseUri) 。

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

解析 body 片段

使用 Jsoup.parseBodyFragment(String html) 方法

String html = "<div><p>Lorem ipsum.</p>";
Document doc = Jsoup.parseBodyFragment(html);
Element body = doc.body();

从 URL 加载文档

使用 Jsoup.connect(String url) 方法。

Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

设置其他参数：

Document doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

从文件加载文档

使用静态方法 Jsoup.parse(File in, String charsetName, String baseUri)。

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

提取数据

使用DOM方法浏览文档

将HTML解析为 Document 后，请使用类似DOM的方法。

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}

查找 Element

getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key) （和关联方法）
同级 Element : siblingElements()，firstElementSibling()，lastElementSibling()；nextElementSibling()，previousElementSibling()
图: parent()，children()，child(int index)

处理 HTML 和文本

attr(String key) 获取和 attr(String key, String value) 设置属性
attributes() 获取所有属性
id(), className() 和 classNames()
text() 获取和 text(String value) 设置文本内容
html() 获取 html(String value) 设置内部 HTML 内容
outerHtml() 获取外部 HTML 值
data() 获取数据内容 (如. tag 的脚本和样式)
tag() 和 tagName()

Element 数据

使用选择器语法查找元素

使用 Element.select(String selector) 和 Elements.select(String selector) 方法:

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
  // img with src ending .png

Element masthead = doc.select("div.masthead").first();
  // div with class=masthead

Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

jsoup 元素支持类似选择器语法的 CSS（或jquery）来查找匹配的元素，从而允许非常强大和健壮的查询。

选择器概述

tagname：通过 tag 查找元素，如："a"
ns|tag：通过 tag 在命名空间内查找元素，如："fb|name" 查找 "fb:name" 元素
#id：通过 ID 查找元素，如："#logo"
.class：通过 class 名称查找元素，如：".masthead"
[attribute]：带属性元素，如："[href]"
[^attr]：带属性名前缀的元素，如："[^data-]" 查找带 HTML5 数据集属性的元素
[attr=value]：带属性值的元素，如："[width=500]"（也可以引用，如："[data-name='launch sequence']"）
[attr^=value]，[attr$=value]，[attr*=value]：带属性值满足开始于、结束于或包含的元素，如："[href*=/path/]"
[attr~=regex]：带属性值满足正则表达式的元素，如："img[src~=(?i).(png|jpe?g)]"

选择器组合

el#id：带 ID 的元素，如："div#logo"
el.class：带 class 的元素，如："div.masthead"
el[attr]：带属性的元素，如："a[href]"
任何组合，如 "a[href].highlight"
祖先-子：指定祖先查找子元素，如：".body p"，在 class 为 "body" 的块下的任意位置查找tag 为 p 的元素
parent> child：父元素的直接子元素，如："div.content> p" 查找 "div.content" 为 p 直接子元素； "body > *" 查找 "body" 下所有直接子元素
siblingA + siblingB：查找紧随兄弟 A 的兄弟 B 元素，如："div.head + div"
siblingA〜siblingX：查找在兄弟 A 之前的兄弟 X 元素，如："h1 ~ p"
el, el, el：将多个选择器组合在一起，找到与任何选择器匹配的唯一元素；如："div.masthead，div.logo"

伪选择器

:lt(n)：查找其兄弟索引（即其在DOM树中相对于其父节点的位置）小于n的元素；如 "td:lt(3)"
:gt(n)：查找兄弟索引大于n的元素；如："div p:gt(2)"
:eq(n)：查找同级索引等于 n 的元素；如："form input:eq(1)"
:has(selector)：查找包含与选择器匹配的元素的元素；如："div:has(p)"
:not(selector)：查找与选择器不匹配的元素；如："div:not(.logo)"
:contains(text)：查找包含给定文本的元素。搜索不区分大小写；如："p:contains(jsoup)"
:containsOwn(text)：查找直接包含给定文本的元素
:matches(regex)：查找文本与指定正则表达式匹配的元素；如："div:matches((?i)login)"
:matchesOwn(regex)：查找其文本与指定的正则表达式匹配的元素
请注意，上面索引的伪选择器基于0，即第一个元素位于索引0，第二个元素位于1，依此类推。

有关完整的支持列表和详细信息，请参见 Selector API参考。

从元素中提取属性、文本和 HTML

若要获取属性的值，请使用 Node.attr(String key) 方法
对于元素（及其组合的子元素）上的文本，请使用 Element.text()
对于HTML，请根据需要使用 Element.html() 或 Node.outerHtml()

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""

String linkOuterH = link.outerHtml(); 
    // "<a href="http://example.com"><b>example</b></a>"
String linkInnerH = link.html(); // "<b>example</b>"

其他方法

使用 URL

确保在解析文档时指定基本URI（从URL加载时是隐式的）
使用 abs: 属性前缀可从属性解析绝对URL

Document doc = Jsoup.connect("http://jsoup.org").get();

Element link = doc.select("a").first();
String relHref = link.attr("href");  // == "/"
String absHref = link.attr("abs:href");  // "http://jsoup.org/"

提取数据示例 —— 列举链接

该示例程序演示了如何从URL提取页面。提取链接、图像和其他指向；并检查其 URL 和文本。

指定要获取的URL作为程序的唯一参数。

package org.jsoup.examples;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
 * Example program to list links from a URL.
 */
public class ListLinks {
    public static void main(String[] args) throws IOException {
        Validate.isTrue(args.length == 1, "usage: supply url to fetch");
        String url = args[0];
        print("Fetching %s...", url);

        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");

        print("\nMedia: (%d)", media.size());
        for (Element src : media) {
            if (src.normalName().equals("img"))
                print(" * %s: <%s> %sx%s (%s)",
                        src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
                        trim(src.attr("alt"), 20));
            else
                print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
        }

        print("\nImports: (%d)", imports.size());
        for (Element link : imports) {
            print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
        }

        print("\nLinks: (%d)", links.size());
        for (Element link : links) {
            print(" * a: <%s>  (%s)", link.attr("abs:href"), trim(link.text(), 35));
        }
    }

    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
}

示例输入（已修剪）

Fetching http://news.ycombinator.com/...

Media: (38)
 * img: <http://ycombinator.com/images/y18.gif> 18x18 ()
 * img: <http://ycombinator.com/images/s.gif> 10x1 ()
 * img: <http://ycombinator.com/images/grayarrow.gif> x ()
 * img: <http://ycombinator.com/images/s.gif> 0x10 ()
 * script: <http://www.co2stats.com/propres.php?s=1138>
 * img: <http://ycombinator.com/images/s.gif> 15x1 ()
 * img: <http://ycombinator.com/images/hnsearch.png> x ()
 * img: <http://ycombinator.com/images/s.gif> 25x1 ()
 * img: <http://mixpanel.com/site_media/images/mixpanel_partner_logo_borderless.gif> x (Analytics by Mixpan.)
 
Imports: (2)
 * link <http://ycombinator.com/news.css> (stylesheet)
 * link <http://ycombinator.com/favicon.ico> (shortcut icon)
 
Links: (141)
 * a: <http://ycombinator.com>  ()
 * a: <http://news.ycombinator.com/news>  (Hacker News)
 * a: <http://news.ycombinator.com/newest>  (new)
 * a: <http://news.ycombinator.com/newcomments>  (comments)
 * a: <http://news.ycombinator.com/leaders>  (leaders)
 * a: <http://news.ycombinator.com/jobs>  (jobs)
 * a: <http://news.ycombinator.com/submit>  (submit)
 * a: <http://news.ycombinator.com/x?fnid=JKhQjfU7gW>  (login)
 * a: <http://news.ycombinator.com/vote?for=1094578&dir=up&whence=%6e%65%77%73>  ()
 * a: <http://www.readwriteweb.com/archives/facebook_gets_faster_debuts_homegrown_php_compiler.php?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+readwriteweb+%28ReadWriteWeb%29&utm_content=Twitter>  (Facebook speeds up PHP)
 * a: <http://news.ycombinator.com/user?id=mcxx>  (mcxx)
 * a: <http://news.ycombinator.com/item?id=1094578>  (9 comments)
 * a: <http://news.ycombinator.com/vote?for=1094649&dir=up&whence=%6e%65%77%73>  ()
 * a: <http://groups.google.com/group/django-developers/msg/a65fbbc8effcd914>  ("Tough. Django produces XHTML.")
 * a: <http://news.ycombinator.com/user?id=andybak>  (andybak)
 * a: <http://news.ycombinator.com/item?id=1094649>  (3 comments)
 * a: <http://news.ycombinator.com/vote?for=1093927&dir=up&whence=%6e%65%77%73>  ()
 * a: <http://news.ycombinator.com/x?fnid=p2sdPLE7Ce>  (More)
 * a: <http://news.ycombinator.com/lists>  (Lists)
 * a: <http://news.ycombinator.com/rss>  (RSS)
 * a: <http://ycombinator.com/bookmarklet.html>  (Bookmarklet)
 * a: <http://ycombinator.com/newsguidelines.html>  (Guidelines)
 * a: <http://ycombinator.com/newsfaq.html>  (FAQ)
 * a: <http://ycombinator.com/newsnews.html>  (News News)
 * a: <http://news.ycombinator.com/item?id=363>  (Feature Requests)
 * a: <http://ycombinator.com>  (Y Combinator)
 * a: <http://ycombinator.com/w2010.html>  (Apply)
 * a: <http://ycombinator.com/lib.html>  (Library)
 * a: <http://www.webmynd.com/html/hackernews.html>  ()
 * a: <http://mixpanel.com/?from=yc>  ()

修改数据

设置属性数据

使用属性设置器方法 Element.attr(String key, String value) 和 Elements.attr(String key, String value)。
如果需要修改元素的 class 属性，请使用 Element.addClass(String className) 和 Element.removeClass(String className) 方法。
Elements 集合具有批量属性和类方法。如：要向 div 中的每个 a 元素添加 rel="nofollow" 属性：

doc.select("div.comments a").attr("rel", "nofollow");

设置元素的 HTML

使用 Element HTML 设置方法：

Element.html(String html) 清除元素已存在的内部 HTML，并将其替换为已解析的 HTML。
Element.prepend(String first) 和 Element.append(String last) 往元素的内部 HTML 开始或结束位置增加 HTML。
Element.wrap(String around) 将HTML包装在元素的外部HTML周围。

Element div = doc.select("div").first(); // <div></div>
div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div>
div.prepend("<p>First</p>");
div.append("<p>Last</p>");
// now: <div><p>First</p><p>lorem ipsum</p><p>Last</p></div>

Element span = doc.select("span").first(); // <span>One</span>
span.wrap("<li><a href='http://example.com/'></a></li>");
// now: <li><a href="http://example.com"><span>One</span></a></li>

设置元素的文本内容

使用 Element 文本设置方法

Element.text(String text) 清除元素中所有已存在的内部 HTML，并将其替换为提供的文本。
Element.prepend(String first) 和 Element.append(String last) 往元素的内部 HTML 开始或结束位置增加文本节点。

Element div = doc.select("div").first(); // <div></div>
div.text("five > four"); // <div>five &gt; four</div>
div.prepend("First ");
div.append(" Last");
// now: <div>First five &gt; four Last</div>

清除 HTML

将 jsoup HTML Cleaner 与 Whitelist 指定的配置一起使用。

String unsafe = 
  "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>