161 skills found · Page 1 of 6
apache / TikaThe Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
yobix-ai / ExtractousFast and efficient unstructured data extraction. Written in Rust with bindings for many languages.
chrismattmann / Tika PythonTika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
dadoonet / FscrawlerElasticsearch File System Crawler (FS Crawler)
Snailclimb / Interview Guide基于 Spring Boot 4.0 + Java 21 + Spring AI + PostgreSQL + pgvector + RustFS + Redis,实现简历智能分析、AI模拟面试、知识库RAG检索等核心功能。非常适合作为学习和简历项目,学习门槛低。
USCDataScience / SparklerSpark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
ICIJ / ExtractA cross-platform command line tool for parallelised content extraction and analysis.
google / Go TikaGo package for using Apache Tika
apache / Tika DockerConvenience Docker images for Apache Tika Server
KevM / TikaondotnetUse the Java Tika text extraction library on the .NET platform
shebinleo / Pdf2htmlpdf2html is a module which helps to convert PDF file to HTML pages using Apache Tika. This module also helps to generate thumbnail image for PDF file using Apache PDFBox.
LogicalSpark / Docker TikaserverApache Tika Server as a Docker Image
ICIJ / Node TikaApache Tika bridge for Node.js. Text and metadata extraction, language detection and more.
chrismattmann / MLwithTensorFlow2edCode for Machine Learning with TensorFlow: 2nd Edition Published by Manning Publications
nasa-jpl-memex / Memex ExplorerViewers for statistics and dashboarding of Domain Search Engine data
vaites / Php Apache TikaApache Tika bindings for PHP: extract text and metadata from documents, images and other formats
chrismattmann / Tika SimilarityTika-Similarity uses the Tika-Python package (Python port of Apache Tika) to compute file similarity based on Metadata features.
pelick / VerticleSearchEngineAcademic Search Engine using Scrapy, MongoDB, Lucene/Solr, Tika, Struts2, Jquery, Bootstrap, D3, CAS
chrismattmann / ImagecatImageCat is an Apache OODT RADIX application that uses Apache Solr, Apache Tika and Apache OODT to ingest 10s of millions of files (images,but could be extended to other files) in place, and to extract metadata and OCR information from those files/images using Tika and Tesseract OCR.
nasa-jpl-memex / Image SpaceInteractive Image similarity and Visual Search and Retrieval application