DotnetSpider
DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework
Install / Use
/learn @dotnetcore/DotnetSpiderREADME
DotnetSpider
免责申明:本框架是为了帮助开发人员简化开发流程、提高开发效率,请勿使用此框架做任何违法国家法律的事情,使用者所做任何事情也与本框架的作者无关。
DotnetSpider, a .NET Standard web crawling library. It is a lightweight, efficient, and fast high-level web crawling & scraping framework.
If you want to get the latest beta packages, you should add the myget feed:
<add key="myget.org" value="https://www.myget.org/F/zlzforever/api/v3/index.json" protocolVersion="3" />
DESIGN

DEVELOP ENVIROMENT
-
Visual Studio 2017 (15.3 or later) or Jetbrains Rider
-
Docker
-
MySql
docker run --name mysql -d -p 3306:3306 --restart always -e MYSQL_ROOT_PASSWORD=1qazZAQ! mysql:5.7 -
Redis (option)
docker run --name redis -d -p 6379:6379 --restart always redis -
SqlServer
docker run --name sqlserver -d -p 1433:1433 --restart always -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=1qazZAQ!' mcr.microsoft.com/mssql/server:2017-latest -
PostgreSQL (option)
docker run --name postgres -d -p 5432:5432 --restart always -e POSTGRES_PASSWORD=1qazZAQ! postgres -
MongoDb (option)
docker run --name mongo -d -p 27017:27017 --restart always mongo -
RabbitMQ
docker run -d --restart always --name rabbimq -p 4369:4369 -p 5671-5672:5671-5672 -p 25672:25672 -p 15671-15672:15671-15672 \ -e RABBITMQ_DEFAULT_USER=user -e RABBITMQ_DEFAULT_PASS=password \ rabbitmq:3-management -
Docker remote api for mac
docker run -d --restart always --name socat -v /var/run/docker.sock:/var/run/docker.sock -p 2376:2375 bobrik/socat TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock -
HBase
docker run -d --restart always --name hbase -p 20550:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16010:16010 dajobe/hbase
MORE DOCUMENTS
https://github.com/dotnetcore/DotnetSpider/wiki
SAMPLES
Please see the Project DotnetSpider.Sample in the solution.
BASE USAGE
ADDITIONAL USAGE: Configurable Entity Spider
[DisplayName("博客园爬虫")]
public class EntitySpider(
IOptions<SpiderOptions> options,
DependenceServices services,
ILogger<Spider> logger)
: Spider(options, services, logger)
{
public static async Task RunAsync()
{
var builder = Builder.CreateDefaultBuilder<EntitySpider>(options =>
{
options.Speed = 1;
});
builder.UseSerilog();
builder.IgnoreServerCertificateError();
await builder.Build().RunAsync();
}
protected override async Task InitializeAsync(CancellationToken stoppingToken = default)
{
AddDataFlow<DataParser<CnblogsEntry>>();
AddDataFlow(GetDefaultStorage);
await AddRequestsAsync(
new Request(
"https://news.cnblogs.com/n/page/1", new Dictionary<string, object> { { "网站", "博客园" } }));
}
[Schema("cnblogs", "news")]
[EntitySelector(Expression = ".//div[@class='news_block']", Type = SelectorType.XPath)]
[GlobalValueSelector(Expression = ".//a[@class='current']", Name = "类别", Type = SelectorType.XPath)]
[GlobalValueSelector(Expression = "//title", Name = "Title", Type = SelectorType.XPath)]
[FollowRequestSelector(Expressions = ["//div[@class='pager']"])]
public class CnblogsEntry : EntityBase<CnblogsEntry>
{
protected override void Configure()
{
HasIndex(x => x.Title);
HasIndex(x => new { x.WebSite, x.Guid }, true);
}
public int Id { get; set; }
[Required]
[StringLength(200)]
[ValueSelector(Expression = "类别", Type = SelectorType.Environment)]
public string Category { get; set; }
[Required]
[StringLength(200)]
[ValueSelector(Expression = "网站", Type = SelectorType.Environment)]
public string WebSite { get; set; }
[StringLength(200)]
[ValueSelector(Expression = "Title", Type = SelectorType.Environment)]
[ReplaceFormatter(NewValue = "", OldValue = " - 博客园")]
public string Title { get; set; }
[StringLength(40)]
[ValueSelector(Expression = "GUID", Type = SelectorType.Environment)]
public string Guid { get; set; }
[ValueSelector(Expression = ".//h2[@class='news_entry']/a")]
public string News { get; set; }
[ValueSelector(Expression = ".//h2[@class='news_entry']/a/@href")]
public string Url { get; set; }
[ValueSelector(Expression = ".//div[@class='entry_summary']")]
[TrimFormatter]
public string PlainText { get; set; }
[ValueSelector(Expression = "DATETIME", Type = SelectorType.Environment)]
public DateTime CreationTime { get; set; }
}
}
Distributed spider
Puppeteer downloader
Coming soon
NOTICE
when you use redis scheduler, please update your redis config:
timeout 0
tcp-keepalive 60
Dependencies
| Package | License | | --- | --- | | Bert.RateLimiters | Apache 2.0 | | MessagePack | MIT | | Newtonsoft.Json | MIT | | Dapper | Apache 2.0 | | HtmlAgilityPack | MIT | | ZCJ.HashedWheelTimer | MIT | | murmurhash | Apache 2.0 | | Serilog.AspNetCore | Apache 2.0 | | Serilog.Sinks.Console | Apache 2.0 | | Serilog.Sinks.RollingFile | Apache 2.0 | | Serilog.Sinks.PeriodicBatching | Apache 2.0 | | MongoDB.Driver | Apache 2.0 | | MySqlConnector | MIT | | AutoMapper.Extensions.Microsoft.DependencyInjection | MIT | | Docker.DotNet | MIT | | BuildBundlerMinifier | Apache 2.0 | | Pomelo.EntityFrameworkCore.MySql | MIT | | Quartz.AspNetCore | Apache 2.0 | | Quartz.AspNetCore.MySqlConnector | Apache 2.0 | | Npgsql | PostgreSQL License | | RabbitMQ.Client | Apache 2.0 | | Polly | BSD 3-C |
AREAS FOR IMPROVEMENTS
QQ Group: 477731655 Email: zlzforever@163.com
Related Skills
node-connect
341.6kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.6kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.6kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.6kCommit, push, and open a PR
