最新项目需要获取maven仓库中开源的组件版本信息,原以为使用wget命令,就可以从 Maven Repo 轻松获取。可惜,理想很丰满,现实很有骨感。既然wget获取不到,那就自己简单实现个爬虫获取吧。
分析过程
打开页面
打开仓库页面:https://repo.maven.apache.org/maven2/
页面上都是以目录和文件的方式展示的。
查看页面源码
可以轻易的发现目录和文件的内容都是在id为“contents”下的a
标签中。
版本信息查看(在maven-metadata.xml)
不断深入某个目录,可以轻易的发现组件的版本信息都在maven-metadata.xml
中进行描述。eg:
https://repo.maven.apache.org/maven2/tech/ibit/sql-builder/maven-metadata.xml 的内容
- <?xml version="1.0" encoding="UTF-8"?>
- <metadata>
- <groupId>tech.ibit</groupId>
- <artifactId>sql-builder</artifactId>
- <versioning>
- <latest>2.0</latest>
- <release>2.0</release>
- <versions>
- <version>1.0</version>
- <version>1.1</version>
- <version>2.0</version>
- </versions>
- <lastUpdated>20201130115230</lastUpdated>
- </versioning>
- </metadata>
maven-metadata.xml
中包含groupId
,artifactId
和version
信息。
综合上述过程,获取maven所有版本信息,可以做以下操作
- 遍历 maven repo 所有目录信息,并获取
maven-metadata.xml
文件 - 解析
maven-metadata.xml
,获取groupId
,artifactId
和version
。
示例代码:
爬取所有的 maven-metadata.xml
文件和目录
- package tech.ibit.crawler;
- import org.apache.commons.lang.StringUtils;
- import org.jsoup.Jsoup;
- import org.jsoup.nodes.Document;
- import org.jsoup.nodes.Element;
- import org.jsoup.select.Elements;
- import java.io.File;
- import java.io.FileWriter;
- import java.io.IOException;
- import java.util.Scanner;
- /**
- * Maven爬虫
- *
- * @author IBIT程序猿
- */
- public class MavenCrawler {
- /**
- * 爬取跟目录
- */
- private static final String ROOT = "https://repo.maven.apache.org/";
- /**
- * maven-metadata.xml文件名
- */
- private static final String MAVEN_METADATA_XML_FILENAME = "maven-metadata.xml";
- public static void main(String[] args) {
- // 参数说明
- // args[0]: 爬取目录
- // args[1]: sleep毫秒数
- // args[2]: 开始层级(可选)
- // args[3]: 开始行(可选)
- String dirPath = args[0];
- File dir = new File(dirPath);
- if (!dir.exists() || !dir.isDirectory()) {
- System.err.println("爬取目录不存在,dir: " + dirPath);
- System.exit(1);
- }
- int sleepMillis = Integer.parseInt(args[1]);
- int level = 0;
- if (args.length > 2) {
- level = Integer.parseInt(args[2]);
- }
- String beginLine = null;
- if (args.length > 3) {
- beginLine = args[3];
- }
- File urlFile;
- boolean begin = null == beginLine;
- while ((urlFile = getLevelFile(dir, level)).exists()) {
- level++;
- boolean fileEmpty = true;
- File subFile = getLevelFile(dir, level);
- try (Scanner scanner = new Scanner(urlFile);
- FileWriter writer = new FileWriter(subFile)) {
- while (scanner.hasNext()) {
- String line = scanner.nextLine();
- if (StringUtils.isNotBlank(line)) {
- fileEmpty = false;
- if (!begin && line.equals(beginLine)) {
- begin = true;
- }
- if (begin) {
- String url = ROOT + line;
- findSubUrl(url, sleepMillis, writer);
- }
- }
- }
- } catch (IOException e) {
- e.printStackTrace();
- }
- if (fileEmpty) {
- urlFile.deleteOnExit();
- subFile.deleteOnExit();
- break;
- }
- }
- }
- /**
- * 获取文件
- *
- * @param dir 目录
- * @param level 等级
- * @return 文件
- */
- private static File getLevelFile(File dir, int level) {
- return new File(dir.getAbsolutePath() + File.separator + "level_" + level + ".txt");
- }
- /**
- * 查询子url
- *
- * @param url 当前url
- * @param sleepMillis 睡眠毫秒数
- * @param writer writer
- */
- private static void findSubUrl(String url, int sleepMillis, FileWriter writer) {
- try {
- if (url.endsWith(MAVEN_METADATA_XML_FILENAME)) {
- return;
- }
- Thread.sleep(sleepMillis);
- Document doc = Jsoup.connect(url).get();
- Elements links = doc.select("#contents a");
- for (Element link : links) {
- String absUrl = link.absUrl("href");
- // 非子目录
- if (!absUrl.contains(url) || url.equals(absUrl)) {
- continue;
- }
- String relativePath = absUrl.substring(url.length());
- if (MAVEN_METADATA_XML_FILENAME.equals(relativePath) || !relativePath.contains(".")) {
- String path = absUrl.substring(ROOT.length());
- writer.write(path + "\n");
- writer.flush();
- System.out.println(path);
- }
- }
- } catch (IOException | InterruptedException e) {
- e.printStackTrace();
- }
- }
- }
说明:
- 需要在保存的文件夹中新建level_0.txt文件,并将初始url https://repo.maven.apache.org/maven2/ 放置于文件中。执行过程中,会按照遍历目录的深度,生成level_1.txt, level_2.txt等。。
- 当前示例代码使用单线程,并设置睡眠时间(避免ip被封),如果需要改为多线程,自行设计。
解析 maven-metadata.xml
示例代码
- package tech.ibit.crawler;
- import org.apache.commons.collections4.CollectionUtils;
- import org.apache.commons.io.IOUtils;
- import org.apache.commons.lang.StringUtils;
- import org.w3c.dom.Document;
- import org.w3c.dom.Node;
- import org.w3c.dom.NodeList;
- import javax.xml.parsers.DocumentBuilder;
- import javax.xml.parsers.DocumentBuilderFactory;
- import java.io.ByteArrayInputStream;
- import java.io.File;
- import java.io.FileWriter;
- import java.io.IOException;
- import java.net.URL;
- import java.nio.charset.StandardCharsets;
- import java.util.LinkedHashSet;
- import java.util.Scanner;
- import java.util.Set;
- /**
- * Maven meta
- *
- * @author IBIT程序猿
- */
- public class MavenMetaDataParser {
- /**
- * 爬取跟目录
- */
- private static final String ROOT = "https://repo.maven.apache.org/";
- /**
- * maven-metadata.xml文件名
- */
- private static final String MAVEN_METADATA_XML_FILENAME = "maven-metadata.xml";
- public static void main(String[] args) {
- // 参数说明
- // args[0]: 爬取目录
- // args[1]: sleep毫秒数
- // args[2]: 开始层级
- // args[3]: 结束层级
- // args[4]: 开始行(可选)
- if (args.length < 4) {
- System.err.println("参数:爬取目录 sleep毫秒数 开始层级 结束层级 开始行(可选)");
- System.exit(1);
- }
- String dirPath = args[0];
- File dir = new File(dirPath);
- if (!dir.exists() || !dir.isDirectory()) {
- System.err.println("爬取目录不存在,dir: " + dirPath);
- System.exit(1);
- }
- int sleepMillis = Integer.parseInt(args[1]);
- int beginLevel = Integer.parseInt(args[2]);
- int endLevel = Integer.parseInt(args[3]);
- String beginLine = null;
- if (args.length > 4) {
- beginLine = args[4];
- }
- boolean begin = null == beginLine;
- for (int i = beginLevel; i <= endLevel; i++) {
- File urlFile = getLevelFile(dir, i);
- if (!urlFile.exists()) {
- break;
- }
- try (Scanner scanner = new Scanner(urlFile);
- FileWriter writer = new FileWriter(getVersionLevelFile(dir, i))) {
- while (scanner.hasNext()) {
- String line = scanner.nextLine();
- if (StringUtils.isNotBlank(line)) {
- if (!begin && line.equals(beginLine)) {
- begin = true;
- }
- if (begin && line.endsWith(MAVEN_METADATA_XML_FILENAME)) {
- String url = ROOT + line;
- appendVersions(url, sleepMillis, writer);
- }
- }
- }
- } catch (IOException e) {
- e.printStackTrace();
- }
- }
- }
- /**
- * 生成版本
- *
- * @param url url
- * @param sleepMillis 睡眠毫秒数
- * @param writer writer
- */
- private static void appendVersions(String url, int sleepMillis, FileWriter writer) {
- try {
- Thread.sleep(sleepMillis);
- String xmlContent = IOUtils.toString(new URL(url), StandardCharsets.UTF_8);
- DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
- DocumentBuilder builder = factory.newDocumentBuilder();
- try (ByteArrayInputStream in = new ByteArrayInputStream(xmlContent.getBytes(StandardCharsets.UTF_8))) {
- Document doc = builder.parse(in);
- String groupId = getSingleValue(doc, "groupId");
- if (StringUtils.isBlank(groupId)) {
- return;
- }
- String artifactId = getSingleValue(doc, "artifactId");
- if (StringUtils.isBlank(artifactId)) {
- return;
- }
- Set<String> versions = getMultiValues(doc, "version");
- if (CollectionUtils.isEmpty(versions)) {
- return;
- }
- String versionLine = groupId + ":" + artifactId + ":" + StringUtils.join(versions, ",");
- writer.write(versionLine + "\n");
- writer.flush();
- System.out.println(versionLine);
- }
- } catch (Exception e) {
- e.printStackTrace();
- }
- }
- /**
- * 获取文件
- *
- * @param dir 目录
- * @param level 等级
- * @return 文件
- */
- private static File getLevelFile(File dir, int level) {
- return new File(dir.getAbsolutePath() + File.separator + "level_" + level + ".txt");
- }
- /**
- * 获取文件
- *
- * @param dir 目录
- * @param level 等级
- * @return 文件
- */
- private static File getVersionLevelFile(File dir, int level) {
- return new File(dir.getAbsolutePath() + File.separator + "version_level_" + level + ".txt");
- }
- /**
- * 获取单个值
- *
- * @param document 文档
- * @param tagName 标签名称
- * @return 单个值
- */
- private static String getSingleValue(Document document, String tagName) {
- NodeList nodeList = document.getElementsByTagName(tagName);
- if (nodeList.getLength() == 0) {
- return null;
- }
- return getNodeValue(nodeList.item(0));
- }
- /**
- * 获取多个值
- *
- * @param document 文档
- * @param tagName 标签名称
- * @return 值集合
- */
- private static Set<String> getMultiValues(Document document, String tagName) {
- Set<String> values = new LinkedHashSet<>();
- NodeList nodeList = document.getElementsByTagName(tagName);
- for (int i = 0; i < nodeList.getLength(); i++) {
- String value = getNodeValue(nodeList.item(i));
- if (null != value) {
- values.add(value);
- }
- }
- return values;
- }
- /**
- * 获取节点值
- *
- * @param node 节点
- * @return 节点值
- */
- private static String getNodeValue(Node node) {
- if (null == node) {
- return null;
- }
- return node.getFirstChild().getNodeValue();
- }
- }
说明
- 该示例代码就是读取爬虫生成的level_x.txt文件中的maven-metadata.xml文件,并解析出对应的groupId, artifactId, version
- 当前示例代码使用单线程,并设置睡眠时间(避免ip被封),如果需要改为多线程,自行设计。
其他说明,pom.xml引入依赖说明
- <dependencies>
- <dependency>
- <groupId>org.jsoup</groupId>
- <artifactId>jsoup</artifactId>
- <version>1.14.3</version>
- </dependency>
- <dependency>
- <groupId>commons-lang</groupId>
- <artifactId>commons-lang</artifactId>
- <version>2.6</version>
- </dependency>
- <dependency>
- <groupId>org.apache.commons</groupId>
- <artifactId>commons-collections4</artifactId>
- <version>4.4</version>
- </dependency>
- <dependency>
- <groupId>commons-io</groupId>
- <artifactId>commons-io</artifactId>
- <version>2.11.0</version>
- </dependency>
- </dependencies>