Python script to archive large pdf into rar with 20x compression rate
Table of contents
Problem
Storing large PDFs (over 2GB) wastes server space and causes problems when downloading with Google Chrome.
Solution
The most efficient way to save server space is to compress PDF to RAR.
RAR compression provides nearly 20x PDF compression and is currently the best compression algorithm for pdf
and can reduce file size tenfold
(comparison chart - zip, 7z, rar)
On a Linux server, this can be done by creating a watchdog script in Python 3 and the patool package.
General idea of the script
- When the 
pdffile appears and ready to work, archiving inrarwill start. 
Pay attention to the phrase "the file is ready to work": a file that is still in the process of being written
to disk cannot be called "ready to work". You need to wait, the file will be completely written to the disk, and only then you can work with it (otherwise the broken file will be archived, which then cannot be read).
- After the successful creation of the 
rar archive, atext filewill be created, which will be a kind ofmarker - signaling to any 
external systemthat the archive issuccessfully ready. 
For example, if the external system is Oracle, and you want to write the RAR file into database field.
Here it is important to track the moment when the file is completely ready and formed for further actions.
For example, it may turn out that the file is not yet fully copied to the directory.
To do this, Linux has several file-specific events.
We need the following Linux file system events:
- IN_CREATE
 - CLOSE_WRITE
 - MOVED_TO
 - MOVED_FROM
 - IN_DELETE
 - IN_DELETE_SELF
 
Requirements
Pyinotify is a Python module for monitoring filesystems changes. Pyinotify relies on a Linux Kernel feature (merged in kernel 2.6.13) called inotify. inotify is an event-driven notifier, its notifications are exported from kernel space to user space through three system calls. pyinotify binds these system calls and provides an implementation on top of them offering a generic and abstract way to manipulate those functionalities.
Follow the official documentation to install pyinotify.
Patool is a library for creating, extracting, testing archives, including in the RAR format.
How to install patool is described here.
Import Libraries and define variables
Let's create the pdf_watchdog.py file, where we will write our Python code.
First of all, we need to define script encoding, import libraries and define some variables.
# -*- coding: utf-8 -*-
import os
import pyinotify
import patoolib
from datetime import datetime
flags = pyinotify.ALL_EVENTS
dir = 'pdf/'
log_file = 'log_watcher.log'Next, our watchdog should only watch files with specific extensions.
Filter files by extension
Let's create
suffix_filtermethod will filter files from a set of extensions that appear in the directory are defined in theSUFFIXESarray.write_logmethod will write the log-file.
SUFFIXES = {'.pdf', '.txt', '.rar'}
def suffix_filter(event):
    # return True to stop processing of event (to "stop chaining")
    return os.path.splitext(event.name)[1] not in SUFFIXES
def write_log(log_str):
    date_str = str(datetime.now().strftime('%Y.%m.%d %H:%M:%S')) + ': '
    res_str = date_str + log_str
    f1 = open(dir + log_file, 'a+')
    f1.write(res_str + '\r\n')
    f1.close()
class EventProcessor(pyinotify.ProcessEvent):
...EventProcessor
So, our EventProcessor class will take the pyinotify.ProcessEvent as a parameter and will process incoming events
from the filesystem.
class EventProcessor(pyinotify.ProcessEvent):
    def __init__(self, callback):
        self.event_callback = callback
    def __call__(self, event):
        if not suffix_filter(event):
            super(EventProcessor, self).__call__(event)
    def process_IN_CREATE(self, event):
        write_log('in CREATE: ' + event.pathname))
    def process_IN_DELETE(self, event):
        write_log('in DELETE: ' + event.pathname))
    def process_IN_DELETE_SELF(self, event):
        write_log('in DELETE_SELF: ' + event.pathname))
    def process_IN_MOVED_FROM(self, event):
        write_log('in MOVED_FROM: ' + event.pathname))
    def process_IN_MOVED_TO(self, event):
        write_log('in MOVED_TO: ' + event.pathname))
    def process_IN_CLOSE_WRITE(self, event):
        write_log('in CLOSE_WRITE: ' + event.pathname))
        if os.path.splitext(event.name)[1] == '.pdf':
            # if RAR-file already exists, delete it
            if os.path.exists(dir + event.name + '.rar'):
                os.remove(dir + event.name + '.rar')
                # if TXT-marker exists, delete it too
                if os.path.exists(dir + event.name + '.rar.txt'):
                    os.remove(dir + event.name + '.rar.txt')
                # creating rar archive
                patoolib.create_archive(dir + event.name
                  + '.rar', (dir + event.name,));
            else:
                # creating rar archive
                patoolib.create_archive(dir + event.name
                  + '.rar', (dir + event.name,));
        if os.path.splitext(event.name)[1] == '.rar':
            if not os.path.exists(dir + event.name + '.rar'):
                write_log('OK. RAR Archive created: ' + event.pathname))
                # creating txt-marker
                f = open(dir + event.name + '.txt', 'a')
                f.write('OK!')
                f.close()
                # deleting source pdf
                if os.path.exists(event.pathname[0:-4]):
                    os.remove(event.pathname[0:-4])
                else:
                    write_log('The file doesnt exist: ' + event.pathname))
File watcher
The file watcher will monitor the files in the directory, and the events that occur with it will then be passed
to the EventProcessor input.
class FileWatcher:
    notifier = None
    def start_watch(self, dir, callback):
        wm = pyinotify.WatchManager()
        self.notifier = pyinotify.Notifier(wm, EventProcessor(callback))
        mask = (pyinotify.IN_CREATE
                | pyinotify.IN_MODIFY
                | pyinotify.IN_DELETE
                | pyinotify.IN_DELETE_SELF
                | pyinotify.IN_MOVED_FROM
                | pyinotify.IN_MOVED_TO
                | pyinotify.IN_CLOSE_WRITE)
        wdd = wm.add_watch(dir, mask, rec=True)
        write_log('Watchdog running...'))
        while True:
            self.notifier.process_events()
            if self.notifier.check_events():
                self.notifier.read_events()Start Watch
The last step in creating our script is to run the start_watch() method
f = FileWatcher()
f.start_watch(dir, None)Video demo of the pdf to rar compression
© Alexander Khudoev. 2022