Skrape.it
A Kotlin-based testing/scraping/parsing library providing the ability to analyze and extract data from HTML (server & client-side rendered). It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. It aims to be a testing lib, but can also be used to scrape websites in a convenient fashion.
Install / Use
/learn @skrapeit/Skrape.itREADME
skrape{it}
| :bellhop_bell: :rotating_light: Help wanted :rotating_light: :bellhop_bell: | |-------------| | Looking for Co-Maintainer(s), please contact christian.draeger1@gmail.com if you are interested in helping to maintain and evolve skrape{it} :heart: |
skrape{it} is a Kotlin-based HTML/XML testing and web scraping library that can be used seamlessly in Spring-Boot, Ktor, Android or other Kotlin-JVM projects. The ability to analyze and extract HTML including client-side rendered DOM trees and all other XML-related markup specifications such as SVG, UML, RSS,... makes it unique. It places particular emphasis on ease of use and a high level of readability by providing an intuitive DSL. First and foremost skrape{it} aims to be a testing tool (not tied to a particular test runner), but it can also be used to scrape websites in a convenient fashion.
Features
Parsing
- [x] Deserialization of HTML/XML from websites, local html files and html as string to data classes / POJOs.
- [x] Designed to deserialize HTML but can handle any XML-related markup specifications such as SVG, UML, RSS or XML itself.
- [x] DSL to select html elements as well as supporting CSS query-selector syntax by string invocation.
Http-Client
- [x] Http-Client without verbosity and ceremony to make requests and corresponding request options like headers, cookies etc. in a fluent style interface.
- [x] Pre-configure client regarding auth and other request settings
- [x] Can handle client side rendered web pages. Javascript execution results can optionally be considered in the response body.
Idiomatic
- [x] Easy to use, idiomatic and type-safe DSL to ensure a high level of readability.
- [x] Build-in matchers/assertions based on infix functions to archive a very high level of readability.
- [x] DSL is behaving like a Fluent-Api to make data extraction/scraping as comfortable as possible.
Compatibility
- [x] Not bind to a specific test-runner, framework or whatever.
- [x] Open to use any other assertion library of your choice.
- [x] Open to implement your own fetcher
- [x] Supports non-blocking fetching / Coroutine support
Extensions
In addition, extensions for well-known testing libraries are provided to extend them with the mentioned skrape{it} functionality. Currently available:
Quick Start
Read the Docs
You'll always find the latest documentation, release notes and examples regarding official releases at https://docs.skrape.it. The README file you are reading right now provides example related to the latest master. Just use it if you won't wait for latest changes to be released. If you don't want to read that much or just want to get a rough overview on how to use skrape{it}, you can have a look at the Documentation by Example section which refers to the current master.
Installation
All our official/stable releases will be published to mavens central repository.
Add dependency
<details open><summary>Gradle</summary>dependencies {
implementation("it.skrape:skrapeit:1.2.2")
}
</details>
<details><summary>Maven</summary>
<dependency>
<groupId>it.skrape</groupId>
<artifactId>skrapeit</artifactId>
<version>1.2.2</version>
</dependency>
</details>
using bleeding edge features before official release
We are offering snapshot releases by publishing every successful build of a commit that has been pushed to master branch. Thereby you can just install the latest implementation of skrape{it}. Be careful since these are non-official releases and may be unstable as well as breaking changes can occur at any time.
Add experimental stuff
<details open><summary>Gradle</summary>repositories {
maven { url = uri("https://oss.sonatype.org/content/repositories/snapshots/") }
}
dependencies {
implementation("it.skrape:skrapeit:0-SNAPSHOT") { isChanging = true } // version number will stay - implementation may change ...
}
// optional
configurations.all {
resolutionStrategy {
cacheChangingModulesFor(0, "seconds")
}
}
</details>
<details><summary>Maven</summary>
<repositories>
<repository>
<id>snapshot</id>
<url>https://oss.sonatype.org/content/repositories/snapshots/</url>
</repository>
</repositories>
...
<dependency>
<groupId>it.skrape</groupId>
<artifactId>skrapeit</artifactId>
<version>0-SNAPSHOT</version>
</dependency>
</details>
Documentation by Example
(referring to current master)
You can find further examples in the projects integration tests.
Android
We have a working Android sample using jetpack-compose in our example projects as living documentation.
Parse and verify HTML from String
@Test
fun `can read and return html from String`() {
htmlDocument("""
<html>
<body>
<h1>welcome</h1>
<div>
<p>first p-element</p>
<p class="foo">some p-element</p>
<p class="foo">last p-element</p>
</div>
</body>
</html>""") {
h1 {
findFirst {
text toBe "welcome"
}
}
p {
withClass = "foo"
findFirst {
text toBe "some p-element"
className toBe "foo"
}
}
p {
findAll {
text toContain "p-element"
}
findLast {
text toBe "last p-element"
}
}
}
}
}
Parse HTML and extract
data class MySimpleDataClass(
val httpStatusCode: Int,
val httpStatusMessage: String,
val paragraph: String,
val allParagraphs: List<String>,
val allLinks: List<String>
)
class HtmlExtractionService {
fun extract() {
val extracted = skrape(HttpFetcher) {
request {
url = "http://localhost:8080"
}
response {
MySimpleDataClass(
httpStatusCode = status { code },
httpStatusMessage = status { message },
allParagraphs = document.p { findAll { eachText } },
paragraph = document.p { findFirst { text } },
allLinks = document.a { findAll { eachHref } }
)
}
}
print(extracted)
// will print:
// MyDataClass(httpStatusCode=200, httpStatusMessage=OK, paragraph=i'm a paragraph, allParagraphs=[i'm a paragraph, i'm a second paragraph], allLinks=[http://some.url, http://some-other.url])
}
}
Parse HTML and extract it
data class MyDataClass(
var httpStatusCode: Int = 0,
var httpStatusMessage: String = "",
var paragraph: String = "",
var allParagraphs: List<String> = emptyList(),
var allLinks: List<String> = emptyList()
)
class HtmlExtractionService {
fun extract() {
val extracted = skrape(HttpFetcher) {
request {
url = "http://localhost:8080"
}
extractIt<MyDataClass> {
it.httpStatusCode = statusCode
it.httpStatusMessage = statusMessage.toString()
htmlDocument {
it.allParagraphs = p { findAll { eachText }}
it.paragraph = p { findFirst { text }}
it.allLinks = a { findAll { eachHref }}
}
}
}
print(extracted)
// will print:
// MyDataClass(httpStatusCode=200, httpStatusM
Related Skills
gh-issues
335.2kFetch GitHub issues, spawn sub-agents to implement fixes and open PRs, then monitor and address PR review comments. Usage: /gh-issues [owner/repo] [--label bug] [--limit 5] [--milestone v1.0] [--assignee @me] [--fork user/repo] [--watch] [--interval 5] [--reviews-only] [--cron] [--dry-run] [--model glm-5] [--notify-channel -1002381931352]
node-connect
335.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
Writing Hookify Rules
82.5kThis skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.
