Phar
deep learning sex position classifier
Install / Use
/learn @rlleshi/PharREADME
P-HAR: Porn Human Action Recognition
Update
:star: In the meantime, I've trained models that surpass 94% accuracy on 20 action categories. They are readily available via an easy-to-use API. Get in touch for more details!
How this AI can benefit you:
- 🏷️ Automated Tagging: Can easily extend to more categories as required.
- ⏱️ Automated Timestamp Generation: Allows users to swiftly navigate to any section of the video, offering a user-friendly experience akin to YouTube's.
- 🔍 Improved Recommendation System: Enhances content suggestions by analyzing the occurrences and timings within the video, providing more relevant and tailored recommendations.
- 🚫 Content Filtering: Facilitates the filtering of specific content, such as non-sexual content, or certain actions and positions, allowing for a more personalized user experience.
- 🎞️ Shorts: Enables the extraction of specific actions from videos to create concise and engaging clips, a feature particularly popular among Gen Z users.
If you're interested in some of the technical details of the first version, read on!
Introduction
This is just a fun, side-project to see how State-of-the-art (SOTA) Human Action Recognition (HAR) models fare in the pornographic domain. HAR is a relatively new, active field of research in the deep learning domain, its goal being the identification of human actions from various input streams (e.g. video or sensor).
The pornography domain is interesting from a technical perspective because of its inherent difficulties. Light variations, occlusions, and a tremendous variations of different camera angles and filming techniques (POV, dedicated camera person) make position (action) recognition hard. We can have two identical positions (actions) and yet be captured in such a different camera perspective to entirely confuse the model in its predictions.
This repository uses three different input streams in order to get the best possible results: rgb frames, human skeleton, and audio. Correspondingly three different models are trained on these input streams and their results are merged through late fusion.
The best current accuracy reached by this multi-model model currently is 75.64%, which is promising considering the small training set. This result will be improved in the future.
The models work on spatio-temporal data, meaning that they processes video clips rather than single images (miles-deep is using single images for example). This is an inherently superior way of performing action recognition.
Currently, 17 actions are supported. You can find the complete list here. More data would be needed to further improve the models (help is welcomed). Read on for more information!
Supported Features
First download the human detector here, pose model here, and HAR models here. Then move them inside the checkpoints/har folder.
Or just use a docker container from the image.
Video Demo
Input a video and get a demo with the top predictions every 7 seconds by default.
python src/demo/multimodial_demo.py video.mp4 demo.mp4
Alternatively, the results can also be dumped in a json file by specifying the output file as such.
If you only want to use the RGB & Skeleton model, then you can disable the audio model like so:
python src/demo/multimodial_demo.py video.mp4 demo.json --audio-checkpoint '' --coefficients 0.5 1.0 --verbose
Check out the detailed usage.
Timestamp Generator
Use the flag --timestamps
python src/demo/multimodial_demo.py video.mp4 demo.json --timestamps
Tag Generator
Given the predictions generated by the multimodial demo (in json), we can grab the top 3 tags (by default) like so:
python src/top_tags.py demo.json
Checkout the detailed usage.
Content Filtering
TODO: depending if people need it.
Deployment
Depends if people find this project useful. Currently one has to install the relevant libraries to use these models. See the installation section below.
Motivation & Usages
The idea behind this project is to try and apply the latest deep learning techniques (i.e. human action recognition) in the pornographic domain.
Once we have detailed information about the kind of actions/positions that are happening in a video a number of uses-cases can apply:
- Improving the recommender system
- Automatic tag generator
- Automatic timestamp generator (when does an action start and finish)
- Cutting content out (for example non-sexual content)
Installation
Docker
Build the docker image: docker build -f docker/Dockerfile . -t rlleshi/phar
Manual Installation
This project is based on MMAction2.
The following installation instructions are for ubuntu (hence should also work for Windows WSL). Check the links for details if you are interested in other operating systems.
- Clone this repo and its submodules:
git clone --recurse-submodules git@github.com:rlleshi/phar.gitand then create and environment with python 3.8+. - Install torch (of course, it is recommended that you have CUDA & CUDNN installed).
- Install the correct version of
mmcvbased on your CUDA & Torch, e.g.pip install mmcv-full==1.3.18 -f https://download.openmmlab.com/mmcv/dist/cu113/torch1.10.0/index.html - Install mmaction:2
cd mmaction2/ && pip install cython --no-cache-dir && pip install --no-cache-dir -e . - Install MMPose, link.
- Install MMDetection, link.
- Install extra dependencies:
pip install -r requirements/extra.txt.
Models
The SOTA results are archieved by late-fusing three models based on three input streams. This results in significant improvements compared to only using an RGB-based model. Since more than one action might happen at the same time (and moreover, currently, some of the actions/positions have are conceptually overlapping), it is best to consider the top 2 accuracy as a performance measurement. Hence, currently the multimodial model has a ~75% accuracy. However, since the dataset is quite small and in total only ~50 experiments have been performed, there is a lot of room for improvement.
Multi-Modial (Rgb + Skeleton + Audio)
The best performing models (performance & runtime wise) are timesformer for the RGB stream, poseC3D for the skeleton stream, and resnet101 for the Audio stream. The results of these models are fused together through late fusion. The models do not have the same importance in the late fusion scoring scheme. Currently the fine-tuned weights are: 0.5; 0.6; 1.0 for the RGB, skeleton & audio model respectively.
Another approach would be to train a model with two of the input streams at a time (i.e. rgb+skeleton & rgb+audio) and then perhaps combine their results. But this wouldn't work due to the nature of the data. When it comes to the audio input streams, it can only be exploited for certain actions (e.g. deepthroat due to the gag reflex or anal due to a higher pitch), while for others it's not possible to derive any insight from their audio (e.g. missionary, doggy and cowgirl do not have any special characteristics to set them apart from an audio perspective).
Likewise, the skeleton-based model can only be used in those instances where the pose estimation is accurate above a certain confidence threshold (for these experiments the threshold used was 0.4). For example, for actions such as scoop-up or the-snake it's hard to get an accurate pose estimation in most camera angles due to the proximity of the human bodies in the frame (the poses get fuzzy and mixed up). This then influences the accuracy of the HAR model negatively. However, for actions such as doggy, cowgirl or missionary, the pose estimation is generally good enough to train a HAR model.
However, if we have a bigger dataset, then we will probably have enough instances of clean samples for the difficult actions such as to train all (17) of them with a skeleton-based model. Skeleton based models are according to the current SOTA literature superior to the rgb-based ones. Ideally of course, the pose estimation models should also be fine tuned in the sex domain in order to get a better overall pose estimation.
Metrics
Accuracy | Weights --- | --- Top 1 Accuracy: 0.6362 <br> Top 2 Accuracy: 0.7524 <br> Top 3 Accuracy: 0.8155 <br> Top 4 Accuracy: 0.8521 <br> Top 5 Accuracy: 0.8771 | Rgb: 0.5 <br> Skeleton: 0.6 <br> Audio: 1.0
RGB model - TimeSformer
The best results for a 3D RGB model are achieved by the attention-based TimeSformer architecture. This model is also very fast in inference (~0.53s / 7s clips).
Metrics
Accuracy | Training Speed | Complexity --- | --- | --- top1_acc 0.5669 <br> top2_acc 0.6834 <br> top3_acc 0.7632 <br> top4_acc 0.8096 <br> top5_acc 0.8411 | Avg iter time: 0.3472 s/iter | Flops: 100.96 GFLOPs <br> Params: 121.27 M
Loss

Classes
All 17 annotations. See annotations.
Skeleton model - PoseC3D
The best results for a skeleton-based model are
Related Skills
docs-writer
99.4k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
340.2kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
arscontexta
2.9kClaude Code plugin that generates individualized knowledge systems from conversation. You describe how you think and work, have a conversation and get a complete second brain as markdown files you own.
living-review
27 OpenClaw skills for academic research teams — literature reviews, hypothesis versioning, grant writing, lab knowledge handoffs, and more.
