OurMarks
A module for extracting exams marks from official PDFs, for the Faculty of Information Technology Engineering at Damascus University
Install / Use
/learn @Rami-Sabbagh/OurMarksREADME
OurMarks
[
][ourmarks npm]
[
][ourmarks bundlephobia]
A module for extracting exams marks from official PDFs, for the Faculty of Information Technology Engineering at Damascus University

Introduction
Students exams marks at the Faculty of Information Technology Engineering at Damascus University are published as PDF documents of excel tables.
The PDF documents doesn't allow the exams marks to be used in excel sheets and other programs, because they're only made to be displayed.
That's why the OurMarks module was created, the module extracts the marks records from the PDF documents into structured data items that can be exported as CSV tables, and used for any computational purposes.
This opens the opportunity for:
- Building a structured database of the student's marks.
- Creating applications for displaying the marks.
- Doing statistical data analysis on the marks.
- Building profiles for students.
- And much more...
Features
- Top-Level API made simple for direct usage
- Written in TypeScript, and so type definitions and IDE auto-complete through VS Code and other IDEs are available
- Well documented and available on [npm][ourmarks npm]
- Supports [Node.js] and the browser
- Introduces no side-effects
Example
Node.js (TypeScript)
import * as fs from 'fs';
import * as path from 'path';
import { getDocument } from 'pdfjs-dist/legacy/build/pdf';
import { extractMarksFromDocument } from 'ourmarks';
// Read the document's data
const TARGET_DOCUMENT = path.resolve(__dirname, './documents/1617010032_programming 3 -2-f1-2021.pdf');
const documentData = fs.readFileSync(TARGET_DOCUMENT);
// Parse the marks
async function main() {
const document = await getDocument(documentData).promise;
const marksRecords = await extractMarksFromDocument(document);
document.destroy();
console.log(marksRecords);
}
// Run the asynchronous function
main().catch(console.error);
Getting Started
Installation
npm install ourmarks pdfjs-dist
or
yarn add ourmarks pdfjs-dist
Basic Usage
The module provides 2 top-level asynchronous functions for extracting marks from PDF documents.
It's expected to have the document loaded using PDF.js first, which is very simple:
import { getDocument } from 'pdfjs-dist';
// Inside your main asynchronous function
async function main() {
const document = await getDocument(rawPDFBinaryData).promise;
// ...
// Don't forget to destroy the document inorder to free the resources allocated.
document.destroy();
}
// Run the asynchronous function
main().catch(console.error);
On
node.jsyou have to importpdfjs-dist/legacy/build/pdfinstead due to compatibility reasons.
rawPDFBinaryDatacan be a Node.jsBufferobject, a url to the document, aUint8Arrayand multiple other options as provided by [PDF.js]
Then the whole document can be processed at once using extractMarksFromDocument:
import { extractMarksFromDocument } from 'ourmarks';
// Inside the main() function defined earlier:
const marksRecords = await extractMarksFromDocument(document);
Or it can be processed page by page using extractMarksFromPage:
import { extractMarksFromPage, MarkRecord } from 'ourmarks';
const wholeRecords: MarkRecord[] = [];
// Inside the main() function defined earlier:
for (let i = 1; i <= document.numPages; i++) {
const page = await document.getPage(i);
const pageRecords = await extractMarksFromPage(page);
wholeRecords.push(...pageRecords);
}
API Documentation
In addition to the top-level extractMarksFromDocument and extractMarksFromPage functions, there are a bunch of other lower-level functions for advanced users.
It's completely unnecessary to use them, but if you want to play around with how the module internally works, you can check the [api documentation][apidocs] and read the 'how it works' section below.
How it works
The marks extractor works through a list of 7 steps:
Step 01: Load the document for parsing
The PDF document is loaded using the PDF.js library so it can be parsed.
Once the document has been loaded, it's possible to load each of its pages.
Step 02: Load each page in the document
Each page in the document is loaded.
Once a page is loaded, it's possible to read its content for processing.
Step 03: Get the text items of each page
For each page, a list of all the text items in it is created.
Each text item has the following data structure:
| Field Name | Type | Description |
|---------------|-------------------------|--------------------------------------------------------------------------------|
| string | string | The content of the item |
| direction | 'ttb' 'ltr' 'rtl' | The direction of the item's content |
| width | number | The width of the item, in document units |
| height | number | The height of the item, in document units |
| tranform | number[] | The 3x3 transformation matrix of the item, with only 6 values stored |
| tranform[0] | number | The (0,0) value in the item's tranformation matrix, represents scale x |
| tranform[1] | number | The (1,0) value in the item's tranformation matrix, represents skew |
| tranform[2] | number | The (0,1) value in the item's tranformation matrix, represents skew |
| tranform[3] | number | The (1,1) value in the item's tranformation matrix, represents scale y |
| tranform[4] | number | The (0,2) value in the item's tranformation matrix, represents translate x |
| tranform[5] | number | The (1,2) value in the item's tranformation matrix, represents translate y |
Step 04: Filter and simplify the text items
With the text items stored in a list, the loaded PDF document can be discarded safely as it's no longer needed.
The items list is filtered from:
- Items with
ttbdirection, we're only intereseted in English and Arabic items. - Item with non-zero
tranform[1]andtranform[2], we're not interested in any items with any rotation/skewing. - Items with empty
''content. - Items with zero
transform[4]ortranform[5], as they are invisible/invalid.
Then each item is mapped into a more simplified data structure:
Each item is determined as Arabic if it has
rtldirection
| Field Name | Type | Description |
|------------|--------------------|--------------------------------------------------------|
| value | string | The content of the simplified item |
| arabic | 'true' 'false' | Whether the item contains any Arabic characters or not |
| x | number | The X coordinates of the item, equal to tranform[4] |
| y | number | The Y coordinates of the item, equal to tranform[5] |
| width | number | The width of the item |
| height | number | The height of the item |
Step 05: Merge close text items

Update at 2022-09-21: The new versions of pdf-js no longer produce this issue!
As of OurMarks 3.0.0 this step has been disabled by default but still available behind an option.
It was found that Arabic content is stored as independent text items of each character.
And so the characters has to be merged back into proper items.

A simple algorithm was created to solve that, here's an overview:
Please note that the coordinates in the PDF documents are bottom-left corner based.
- Sort the list of items in ascending order, first by their Y coordinates, then by their X coordinates.
- For each range in the list with the same Y coordinates do:
- Iterate over the row's items in left to right order:
- Check if the current item should be merged with the previous one:
- They should match in height.
- Neither of the items should be protected.
- An item is considered protected if it's a number of 5 digits (a student id).
- Define
errorTolerance = currentItem.height / 10.
- Check if the current item should be merged with the previous one:
- Iterate over the row's items in left to right order:
Related Skills
node-connect
352.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
Writing Hookify Rules
111.1kThis skill should be used when the user asks to "create a hookify rule", "write a hook rule", "configure hookify", "add a hookify rule", or needs guidance on hookify rule syntax and patterns.
review-duplication
100.6kUse this skill during code reviews to proactively investigate the codebase for duplicated functionality, reinvented wheels, or failure to reuse existing project best practices and shared utilities.
