Python script to archive large pdf into rar with 20x compression rate
Table of contents
Problem
Storing large PDFs (over 2GB) wastes server space and causes problems when downloading with Google Chrome.
Solution
The most efficient way to save server space is to compress PDF
to RAR
.
RAR compression provides nearly 20x PDF compression and is currently the best compression algorithm for pdf
and can reduce file size tenfold
(comparison chart - zip, 7z, rar)
On a Linux server, this can be done by creating a watchdog script in Python 3 and the patool package.
General idea of the script
- When the
pdf
file appears and ready to work, archiving inrar
will start.
Pay attention to the phrase "the file is ready to work"
: a file that is still in the process of being written
to disk cannot be called "ready to work". You need to wait, the file will be completely written to the disk, and only then you can work with it (otherwise the broken file will be archived, which then cannot be read).
- After the successful creation of the
rar archive
, atext file
will be created, which will be a kind ofmarker
- signaling to any
external system
that the archive issuccessfully ready
.
For example, if the external system is Oracle
, and you want to write the RAR file
into database field
.
Here it is important to track the moment when the file is completely ready
and formed for further actions.
For example, it may turn out that the file is not yet fully copied to the directory.
To do this, Linux has several file-specific events.
We need the following Linux file system events
:
- IN_CREATE
- CLOSE_WRITE
- MOVED_TO
- MOVED_FROM
- IN_DELETE
- IN_DELETE_SELF
Requirements
Pyinotify is a Python module for monitoring filesystems changes. Pyinotify relies on a Linux Kernel feature (merged in kernel 2.6.13) called inotify. inotify is an event-driven notifier, its notifications are exported from kernel space to user space through three system calls. pyinotify binds these system calls and provides an implementation on top of them offering a generic and abstract way to manipulate those functionalities.
Follow the official documentation to install pyinotify.
Patool is a library for creating, extracting, testing archives, including in the RAR format.
How to install patool is described here.
Import Libraries and define variables
Let's create the pdf_watchdog.py
file, where we will write our Python code.
First of all, we need to define script encoding, import libraries and define some variables.
# -*- coding: utf-8 -*-
import os
import pyinotify
import patoolib
from datetime import datetime
flags = pyinotify.ALL_EVENTS
dir = 'pdf/'
log_file = 'log_watcher.log'
Next, our watchdog should only watch files with specific extensions.
Filter files by extension
Let's create
suffix_filter
method will filter files from a set of extensions that appear in the directory are defined in theSUFFIXES
array.write_log
method will write the log-file.
SUFFIXES = {'.pdf', '.txt', '.rar'}
def suffix_filter(event):
# return True to stop processing of event (to "stop chaining")
return os.path.splitext(event.name)[1] not in SUFFIXES
def write_log(log_str):
date_str = str(datetime.now().strftime('%Y.%m.%d %H:%M:%S')) + ': '
res_str = date_str + log_str
f1 = open(dir + log_file, 'a+')
f1.write(res_str + '\r\n')
f1.close()
class EventProcessor(pyinotify.ProcessEvent):
...
EventProcessor
So, our EventProcessor
class will take the pyinotify.ProcessEvent
as a parameter and will process incoming events
from the filesystem.
class EventProcessor(pyinotify.ProcessEvent):
def __init__(self, callback):
self.event_callback = callback
def __call__(self, event):
if not suffix_filter(event):
super(EventProcessor, self).__call__(event)
def process_IN_CREATE(self, event):
write_log('in CREATE: ' + event.pathname))
def process_IN_DELETE(self, event):
write_log('in DELETE: ' + event.pathname))
def process_IN_DELETE_SELF(self, event):
write_log('in DELETE_SELF: ' + event.pathname))
def process_IN_MOVED_FROM(self, event):
write_log('in MOVED_FROM: ' + event.pathname))
def process_IN_MOVED_TO(self, event):
write_log('in MOVED_TO: ' + event.pathname))
def process_IN_CLOSE_WRITE(self, event):
write_log('in CLOSE_WRITE: ' + event.pathname))
if os.path.splitext(event.name)[1] == '.pdf':
# if RAR-file already exists, delete it
if os.path.exists(dir + event.name + '.rar'):
os.remove(dir + event.name + '.rar')
# if TXT-marker exists, delete it too
if os.path.exists(dir + event.name + '.rar.txt'):
os.remove(dir + event.name + '.rar.txt')
# creating rar archive
patoolib.create_archive(dir + event.name
+ '.rar', (dir + event.name,));
else:
# creating rar archive
patoolib.create_archive(dir + event.name
+ '.rar', (dir + event.name,));
if os.path.splitext(event.name)[1] == '.rar':
if not os.path.exists(dir + event.name + '.rar'):
write_log('OK. RAR Archive created: ' + event.pathname))
# creating txt-marker
f = open(dir + event.name + '.txt', 'a')
f.write('OK!')
f.close()
# deleting source pdf
if os.path.exists(event.pathname[0:-4]):
os.remove(event.pathname[0:-4])
else:
write_log('The file doesnt exist: ' + event.pathname))
File watcher
The file watcher will monitor the files in the directory, and the events that occur with it will then be passed
to the EventProcessor
input.
class FileWatcher:
notifier = None
def start_watch(self, dir, callback):
wm = pyinotify.WatchManager()
self.notifier = pyinotify.Notifier(wm, EventProcessor(callback))
mask = (pyinotify.IN_CREATE
| pyinotify.IN_MODIFY
| pyinotify.IN_DELETE
| pyinotify.IN_DELETE_SELF
| pyinotify.IN_MOVED_FROM
| pyinotify.IN_MOVED_TO
| pyinotify.IN_CLOSE_WRITE)
wdd = wm.add_watch(dir, mask, rec=True)
write_log('Watchdog running...'))
while True:
self.notifier.process_events()
if self.notifier.check_events():
self.notifier.read_events()
Start Watch
The last step in creating our script is to run the start_watch()
method
f = FileWatcher()
f.start_watch(dir, None)
Video demo of the pdf to rar compression
© Alexander Khudoev. 2022