Somef
SOftware Metadata Extraction Framework: A tool for automatically extracting relevant software information from code repositories (using README files, package metadata, etc.)
Install / Use
/learn @KnowledgeCaptureAndDiscovery/SomefREADME
Software Metadata Extraction Framework (SOMEF)
<img src="docs/logo.png" alt="logo" width="150"/>A command line interface for automatically extracting relevant metadata from code repositories (readme, configuration files, documentation, etc.).
Demo: See a demo running somef as a service, through the SOMEF-Vider tool.
Authors: Daniel Garijo, Allen Mao, Miguel Ángel García Delgado, Haripriya Dharmala, Vedant Diwanji, Jiaying Wang, Aidan Kelley, Jenifer Tabita Ciuciu-Kiss, Luca Angheluta and Juanje Mendoza.
Features
Given a readme file (or a GitHub/Gitlab repository) SOMEF will extract the following categories (if present), listed in alphabetical order:
- Acknowledgement: Text acknowledging funding sources or contributors
- Application domain: The application domain of the repository. Current supported domains include: Astrophysics, Audio, Computer vision, Graphs, Natural language processing, Reinforcement learning, Semantc web, Sequential. Domains are not mutually exclusive. These domains have been extracted from awesome lists and Papers with code. Find more information in our documentation
- Authors: Person(s) or organization(s) responsible for the project. We recognize the following properties:
- Name: name of the author (including last name)
- Given name: First name of an author
- Family name: Last name of an author
- Email: email of author
- URL: website or ORCID associated with the author
- Build file: Build file(s) of the project. For example, files used to create a Docker image for the target software, package files, etc.
- Citation: Preferred citation as the authors have stated in their readme file. SOMEF recognizes Bibtex, Citation File Format files and other means by which authors cite their papers (e.g., by in-text citation). We aim to recognize the following properties:
- Title: Title of the publication
- Author: list of author names in the publication
- URL: URL of the publication
- DOI: Digital object identifier of the publication
- Date published
- Code of conduct: Link to the code of conduct of the project
- Code repository: Link to the GitHub/GitLab repository used for the extraction
- Contact: Contact person responsible for maintaining a software component
- Continuous integration: Link to continuous integration service(s)
- Contribution guidelines: Text indicating how to contribute to this code repository
- Contributors: Contributors to a software component. Note: Contributor metadata is exported from metadata files (e.g., CodeMeta, CONTRIBUTORS, etc.) not from git logs.
- Creation date: Date when the repository was created
- Copyright holder: Entity or individual owning the rights to the software. The year is also extracted, if available.
- Date updated: Date of last release.
- Description: A description of what the software does
- Documentation: Where to find additional documentation about a software component
- Download URL: URL where to download the target software (typically the installer, package or a tarball to a stable version)
- Executable examples: Jupyter notebooks ready for execution (e.g., files, or through myBinder/colab links)
- FAQ: Frequently asked questions about a software component
- Forks count: Number of forks of the project
- Forks url: Links to forks made of the project
- Full name: Name + owner (owner/name)
- Full title: If the repository is a short name, we will attempt to extract the longer version of the repository name
- Identifier: Identifier associated with the software (if any), such as Digital Object Identifiers and Software Heritage identifiers (SWH). DOIs associated with publications will also be detected.
- Images: Images used to illustrate the software component
- Installation instructions: A set of instructions that indicate how to install a target repository
- Invocation: Execution command(s) needed to run a scientific software component
- Issue tracker: Link where to open issues for the target repository
- Keywords: set of terms used to commonly identify a software component
- License: License and usage terms of a software component
- Logo: Main logo used to represent the target software component
- Maintainer: Individuals or teams responsible for maintaining the software component, extracted from the CODEOWNERS file
- Name: Name identifying a software component
- Ontologies: URL and path to the ontology files present in the repository
- Owner: Name and type of the user or organization in charge of the repository
- Package distribution: Links to package sites like pypi in case the repository has a package available.
- Package files: Links to package files used to wrap the project in a package.
- Programming languages: Languages used in the repository
- Related papers: URL to possible related papers within the repository stated within the readme file (from Arxiv)
- Releases (GitHub only): Pointer to the available versions of a software component. For each release, somef will track the following properties:
- Description: Release notes
- Author: Agent responsible of creating the release
- Name: Name of the release
- Tag: version number of the release
- Date of publication
- Date of creation
- Link to the html page of the release
- Id of the release
- Link to the tarball zip and code of the release
- Repository status: Repository status as it is described in repostatus.org.
- Requirements: Pre-requisites and dependencies needed to execute a software component
- Run: Running instructions of a software component. It may be wider than the
invocationcategory, as it may include several steps and explanations. - Runtime platform: specifies runtime platform or script interpreter dependencies required to run the project..
- Script files: Bash script files contained in the repository
- Stargazers count: Total number of stargazers of the project
- Support: Guidelines and links of where to obtain support for a software component
- Support channels: Help channels one can use to get support about the target software component
- Type: type of software (command line application, notebook, ontology, scientific workflow, etc.)
- Usage examples: Assumptions and considerations recorded by the authors when executing a software component, or examples on how to use it
- Workflows: URL and path to the computational workflow files present in the repository
We use different supervised classifiers, header analysis, regular expressions, the GitHub/Gitlab API to retrieve all these fields (more than one technique may be used for each field) and language specific metadata parsers (e.g., for package files). Each extraction records its provenance, with the confidence and technique used on each step. For more information check the output format description
Documentation
See full documentation at https://somef.readthedocs.io/en/latest/
Cite SOMEF:
Journal publication (preferred):
@article{10.1162/qss_a_00167,
author = {Kelley, Aidan and Garijo, Daniel},
title = "{A Framework for Creating Knowledge Graphs of Scientific Software Metadata}",
journal = {Quantitative Science Studies},
pages = {1-37},
year = {2021},
month = {11},
issn = {2641-3337},
doi = {10.1162/qss_a_00167},
url = {https://doi.org/10.1162/qss_a_00167},
eprint = {https://direct.mit.edu/qss/article-pdf/doi/10.1162/qss\_a\_00167/1971225/qss\_a\_00167.pdf},
}
Conference publication (first):
@INPROCEEDINGS{9006447,
author={A. {Mao} and D. {Garijo} and S. {Fakhraei}},
booktitle={2019 IEEE International Conference on Big Data (Big Data)},
title={SoMEF: A Framework for Capturing Scientific Software Metadata from its Documentation},
year={2019},
doi={10.1109/BigData47090.2019.9006447},
url={http://dgarijo.com/papers/SoMEF.pdf},
pages={3032-3037}
}
Requirements
- Python 3.11 + (default version support). Python 3.9 and 3.10 will work, but are not supported anymore.
SOMEF has been tested on Unix, MacOS and Windows Microsoft operating systems.
If you face any issues when installing SOMEF, please make sure you have installed the following packages: build-essential, libssl-dev, libffi-dev and python3-dev.
Install from Pypi
SOMEF is available in Pypi! To install it just type:
pip install somef
Install from GitHub
To run SOMEF, please follow the next steps:
Clone this GitHub repository
git clone https://github.com/KnowledgeCaptureAndDiscovery/somef.git
We use Poetry to ensure library compatibility. It can be installed as follows:
curl -sSL https://install.python-poetry.org | python3 -
This option
Related Skills
node-connect
351.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.9kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
351.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
351.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
