OCEC
Open closed eyes classification. Ultra-fast wink and blink estimation model.
Install / Use
/learn @PINTO0309/OCECREADME
OCEC
Open closed eyes classification. Ultra-fast wink and blink estimation model.
In the real world, attempting to detect eyes larger than 20 pixels high and 40 pixels wide is a waste of computational resources.
https://github.com/user-attachments/assets/2ae9467f-a67f-447e-8704-d16efacdacf1
|Variant|Size|F1|CPU<br>inference<br>latency|ONNX| |:-:|:-:|:-:|:-:|:-:| |P|112 KB|0.9924|0.16 ms|Download| |N|176 KB|0.9933|0.25 ms|Download| |S|494 KB|0.9943|0.41 ms|Download| |C|875 KB|0.9947|0.49 ms|Download| |M|1.7 MB|0.9949|0.57 ms|Download| |L|6.4 MB|0.9954|0.80 ms|Download|
Setup
git clone https://github.com/PINTO0309/OCEC.git && cd OCEC
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
source .venv/bin/activate
Inference
uv run python demo_ocec.py \
-v 0 \
-m deimv2_dinov3_s_wholebody34_1750query_n_batch_640x640.onnx \
-om ocec_l.onnx \
-ep cuda
uv run python demo_ocec.py \
-v 0 \
-m deimv2_dinov3_s_wholebody34_1750query_n_batch_640x640.onnx \
-om ocec_l.onnx \
-ep tensorrt
Dataset Preparation
uv run python 01_dataset_viewer.py --split train
uv run python 01_dataset_viewer.py --split train --visualize
uv run python 01_dataset_viewer.py --split train --extract
uv run python 02_real_data_size_hist.py \
-v real_data/open.mp4 \
-oep tensorrt \
-dvw
# [Eye Analysis] open
# Total frames processed: 930
# Frames with Eye detections: 930
# Frames without Eye detections: 0
# Frames with ≥3 Eye detections: 2
# Total Eye detections: 1818
# Histogram PNG: output_eye_analysis/open_eye_size_hist.png
# Width -> mean=20.94, median=22.00
# Height -> mean=11.39, median=11.00
uv run python 02_real_data_size_hist.py \
-v real_data/closed.mp4 \
-oep tensorrt \
-dvw
# [Eye Analysis] closed
# Total frames processed: 1016
# Frames with Eye detections: 1016
# Frames without Eye detections: 0
# Frames with ≥3 Eye detections: 38
# Total Eye detections: 1872
# Histogram PNG: output_eye_analysis/closed_eye_size_hist.png
# Width -> mean=15.25, median=14.00
# Height -> mean=8.17, median=7.00
Considering practical real-world sizes,
I adopt an input resolution of height x width = 24 x 40.
uv run python 03_wholebody34_data_extractor.py \
-ea \
-m deimv2_dinov3_x_wholebody34_680query_n_batch_640x640.onnx \
-oep tensorrt
# Eye-only detection summary
# Total images: 131174
# Images with detection: 130596
# Images without detection: 578
# Images with >=3 detections: 1278
# Crops per label:
# closed: 134522
# open: 110796
# Eye-only detection summary
# Total images: 144146
# Images with detection: 143364
# Images without detection: 782
# Images with >=3 detections: 1221
# Crops per label:
# closed: 136347
# open: 135319
Data sample
|Label|ex1|ex2|ex3|ex4|ex5|ex6|ex7|ex8|ex9|ex10| |:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:| |open|<img width="23" height="9" alt="open1_00000001_1" src="https://github.com/user-attachments/assets/e9e3d5f8-3725-4a02-bad2-6d782bc53df3" />|<img width="27" height="9" alt="open1_00000001_2" src="https://github.com/user-attachments/assets/1319e718-c7d8-4dc6-ac8a-d0e935e09cff" />|<img width="23" height="11" alt="open2_00002606_1" src="https://github.com/user-attachments/assets/591b76f9-335a-4ce5-bb02-0d2e18f890a5" />|<img width="19" height="11" alt="open2_00002606_2" src="https://github.com/user-attachments/assets/dc50b131-e9e1-4008-9cf6-64fb72c82381" />|<img width="61" height="37" alt="open3_00000289_1" src="https://github.com/user-attachments/assets/7d5267c3-cd00-403d-b651-f876991f330a" />|<img width="51" height="35" alt="open3_00000289_2" src="https://github.com/user-attachments/assets/35b36785-89b7-4d5d-b74e-c1f71ff3ac32" />|<img width="12" height="10" alt="open4_00000869_1" src="https://github.com/user-attachments/assets/83af4630-a97c-4d22-bd62-772399485fdc" />|<img width="18" height="12" alt="open4_00000869_2" src="https://github.com/user-attachments/assets/5ac9c003-6690-4f2e-a8a0-0992414438c9" />|<img width="26" height="14" alt="open5_00001725_1" src="https://github.com/user-attachments/assets/93dc4094-26ba-42b3-bd66-13d6f67e6f79" />|<img width="35" height="14" alt="open5_00001725_2" src="https://github.com/user-attachments/assets/6ec669b8-ae53-4748-bdb5-c884cb5f4e2e" />| |closed|<img width="22" height="9" alt="closed_00000414_1" src="https://github.com/user-attachments/assets/ae69e8e6-3d9e-4479-893d-7c0aa236334c" />|<img width="23" height="9" alt="closed_00000414_2" src="https://github.com/user-attachments/assets/6ef22ac9-e7c6-464b-a27d-6b41a8e43322" />|<img width="20" height="13" alt="closed_00000962_1" src="https://github.com/user-attachments/assets/74fc2c2e-66ae-4076-ad24-a405b45dde1f" />|<img width="24" height="13" alt="closed_00000962_2" src="https://github.com/user-attachments/assets/8c2a687e-d8b8-4933-ba32-b311f67aa79e" />|<img width="17" height="8" alt="closed_00001317_1" src="https://github.com/user-attachments/assets/8a588ac3-5415-4405-91b8-98a5540f4d21" />|<img width="22" height="9" alt="closed_00001317_2" src="https://github.com/user-attachments/assets/85bd54f1-f790-4501-ac4b-3e07395539fc" />|<img width="23" height="11" alt="closed_00001502_1" src="https://github.com/user-attachments/assets/4f4157ef-2719-4b6f-a35c-da96a2efe018" />|<img width="19" height="7" alt="closed_00001502_2" src="https://github.com/user-attachments/assets/edc2cf1c-c2e2-4809-b9ee-e5675d69cdd4" />|<img width="20" height="8" alt="closed_00001752_1" src="https://github.com/user-attachments/assets/b8836987-eaab-4be1-8704-f428b48a8727" />|<img width="23" height="6" alt="closed_00001752_2" src="https://github.com/user-attachments/assets/a7890864-5870-4c34-9e97-8065f44375c5" />|
<img width="800" alt="open_eye_size_hist" src="https://github.com/user-attachments/assets/806dc7ec-4132-45a5-ae38-498eb848cf86" /> <img width="800" alt="closed_eye_size_hist" src="https://github.com/user-attachments/assets/4806ae1c-0654-4b79-919f-a113235d6441" />uv run python 04_dataset_convert_to_parquet.py \
--annotation data/cropped/annotation.csv \
--output data/dataset.parquet \
--train-ratio 0.8 \
--seed 42 \
--embed-images
#Split summary: {'train_total': 196253, 'train_closed': 107617, 'train_open': 88636, 'val_total': 49065, 'val_closed': 26905, 'val_open': 22160}
#Saved dataset to data/dataset.parquet (245318 rows).
# Split summary: {'train_total': 217332, 'train_closed': 109077, 'train_open': 108255, 'val_total': 54334, 'val_closed': 27270, 'val_open': 27064}
# Saved dataset to data/dataset.parquet (271666 rows).
Generated parquet schema (split, label, class_id, image_path, source):
split:trainorval, assigned with an 80/20 stratified split per label.label: string eye state (open,closed); inferred from filename or class id.class_id: integer class id (0closed,1open) maintained from the annotation.image_path: path to the cropped PNG stored underdata/cropped/....source:train_datasetfor000000001-prefixed folders,real_datafor100000001+,unknownotherwise.image_bytes(optional): raw PNG bytes for each crop when--embed-imagesis supplied.
Rows are stratified within each label before concatenation, so both splits keep similar open/closed proportions. Class counts per split are printed when the conversion script runs.
Training Pipeline
-
Use the images located under
dataset/output/002_xxxx_front_yyyyyytogether with their annotations indataset/output/002_xxxx_front.csv. -
Every augmented image that originates from the same
still_imagestays in the same split to prevent leakage. -
The training loop relies on
BCEWithLogitsLoss,pos_weight, and aWeightedRandomSamplerto stabilise optimisation under class imbalance; inference produces sigmoid probabilities. -
Training history, validation metrics, optional test predictions, checkpoints, configuration JSON, and ONNX exports are produced automatically.
-
Per-epoch checkpoints named like
ocec_epoch_0001.ptare retained (latest 10), as well as the best checkpoints namedocec_best_epoch0004_f1_0.9321.pt(also latest 10). -
The backbone can be switched with
--arch_variant. Supported combinations with--head_variantare:|
--arch_variant| Default (--head_variant auto) | Explicitly selectable heads | Remarks | |------------------|-----------------------------|---------------------------|------| |baseline|avg|avg,avgmax_mlp| When usingtransformer/mlp_mixer, you need to adjust the height and width of the feature map so that they are divisible by--token_mixer_grid(if left as is, an exception will occur during ONNX conversion or inference). | |inverted_se|avgmax_mlp|avg,avgmax_mlp| When usingtransformer/mlp_mixer, it is necessary to adjust--token_mixer_gridas above. | |convnext|transformer|avg,avgmax_mlp,transformer,mlp_mixer| For both heads, the grid must be divisible by the feature map (default3x2fits with 30x48 input). | -
The classification head is selected with
--head_variant(avg,avgmax_mlp,transformer,mlp_mixer, orautowhich derives a sensible def
