pdf去水印实现

更新于：2024-06-09 15:00

一、PDF的相关运算符

1. 概述

想要删除 pdf 文件中的水印，就需要了解一些水印在 pdf 中的一些运算符。下面是需要了解的相关知识点概览：

1. PDF页面由一个或多个内容流组成，由页面对象中的 `/Contents` 条目定义，可以通过读写其内容流修改对应页面的内容
2. 页面结构里的 `/Contents` 就是我们用于解析水印的基础，它包含了页面里所展示的内容
3. 需要关注的一些运算符：
  3-1. 图形状态运算符 (cm q Q)
  3-2. 文本运算符 (Tm Tj TJ)
  3-3. 标记内容运算符 (BDC EMC)
  3-4. 外部对象操作符 (Do)

2. 图形状态运算符 cm

可以使用cm运算符来更改从用户空间坐标到设备空间坐标的转换，这被称为当前转换矩阵 (Current Transformation Matrix, CTM)，对图形状态的这种改变是由q/Q对隔离的。cm运算符有六个参数，表示要与 CTM 组成的矩阵，并且是将给定的变换附加到 CTM，而并非替换它，以下则是基本的变换。

从起始点 (0, 0) 向 (0+dx, 0+dy) 平移可由 1, 0, 0, 1, dx, dy 六个参数指定；
长宽由 (1, 1) 缩放至 (sx, sy) 可由 sx, 0, 0, sy, 0, 0 六个参数指定；
围绕 (0, 0) 逆时针旋转 x 弧度可以由 cos x, sin x, -sin x, cos x, 0, 0 六个参数指定。

下面提供在 pdf 协议码中的样例：

# 逆时针旋转 acos(0.96) 个弧度
q
0.96 0.25 -0.25 0.96 0 0 cm
...
Q
q
# 将原始形状缩放至 0.5
0.5 0 0 0.5 0 0 cm
...
Q

所以很多时候如果cm预算符只符合旋转的特征，可以直接取 cos x 来获取字条角度。

3. 文本运算符 Tm

Tm运算符有六个参数，定义为 [a,b,c,d,e,f] ，将文本矩阵和文本行矩阵设置如下，直接替换当前矩阵，而不是与其连接。

⌈ a    b    0 ⌉
| c    d    0 |
⌊ d    f    1 ⌋

Tm运算符的变换与cm类似，也可以分为平移、缩放、旋转三种(忽略倾斜)，易得三种变化互不干扰，可以轻易还原使用的变换。下面是一个简单的文本旋转水印，用它来计算一下。

([], b'BT')
([22.81844, 22.81844, -22.81844, 22.81844, 707.65, 126.13], b'Tm')
([[b'\n\x00\x15', 138, b'\x00\x15', 138, b'\x00\x15', 138, b'\x00\x15']], b'TJ')
([], b'ET')

取得Tm矩阵为 [22.81844, 22.81844, -22.81844, 22.81844, 707.65, 126.13]，转为如下，观察到矩阵并非单一变换，所以开始尝试分解。
first

先将矩阵中的“位移变换矩阵”分解出来，如下图所示：
second

接着由于“缩放变换矩阵”和“旋转变化矩阵”的 e, f 参数都为 0，可以只算左上角的 2X2 矩阵，即需要分解成旋转和缩放矩阵，而这两个矩阵的乘积如下:
third

快速计算旋转的角度 θ，可以忽略缩放变换矩阵的干扰。
fourth

所以上面的Tm矩阵中的旋转角度tan(θ) = 1，atan(θ) = 0.785398，θ=45°，计算出该字条角度。

4. 标记内容运算符 BDC/EMC

部分软件生成的水印会将字条类型直接标记为水印，例如下面样例所示：

(['/Artifact', {'/Subtype': '/Watermark', '/Type': '/Pagination'}], b'BDC')
...
([], b'EMC')

5. 外部对象操作符 Do

这个符号本来没什么特征，但是很多页面由于水印内容相同，会调用同一个外部对象，这就增大了对应文本的嫌疑。

二、标记内容水印

去除与标记内容相关的水印，用于去除 PDF 源码里如下包含/Subtype/Watermark的水印。

/Artifact <</Subtype/Watermark/Type/Pagination>> BDC
q
...
Q
EMC

三、代码展示

下面代码展示了清除 pdf 文件中的水印内容。

def remove_marked_content_watermark(operations):
    new_operations = []
    i, last_end_index = 0, 0
    while i < len(operations):
        origin_i = i
        start_index, end_index = _search_marked_content_element(operations, i)
        if end_index < 0:
            break
        if start_index < 0:
            new_operations.extend(operations[last_end_index:end_index])
            last_end_index = i = end_index
        if 0 <= start_index < end_index:
            last_end_index = i = end_index + 1
        i = max(i, origin_i + 1)
    new_operations.extend(operations[last_end_index:])
    return new_operations


def remove_rotate_text_watermark(operations):
    new_operations = []
    last_end_index = 0
    min_cos, max_cos = cos(60 / 180 * pi), cos(30 / 180 * pi)
    i = 0
    while i < len(operations):
        operands, operator = operations[i]
        origin_i = i
        if operator == b_("Tm") and min_cos < _get_rotation_cos(operands) < max_cos:
            for j in range(i + 1, len(operations)):
                next_operands, next_operator = operations[j]
                if next_operator in {b_("Tj"), b_("TJ")}:
                    new_operations.extend(operations[last_end_index:j])
                    last_end_index = j + 1
                elif next_operator in {b_("Tm"), b_("ET")}:
                    i = j
                    break
        i = max(i, origin_i + 1)
    new_operations.extend(operations[last_end_index:])
    return new_operations


def remove_do_type_watermark(operations):
    new_operations = []
    last_end_index = 0
    i = 0
    while i < len(operations):
        operands, operator = operations[i]
        origin_i = i
        if operator == b_("Do"):
            start_index, end_index = _search_single_do_element(operations, i)
            if end_index <= start_index:
                start_index, end_index = _search_rotate_do_element(operations, i)
            if end_index > start_index:
                new_operations.extend(operations[last_end_index: start_index])
                last_end_index = end_index
                i = end_index
        i = max(i, origin_i + 1)
    new_operations.extend(operations[last_end_index:])
    return new_operations

完整源码在此处。

四、小结

本篇文档简单讲述一下 pdf 协议码中水印的存在位置以及如何处理，目前除水印这块逻辑仍然需要继续优化，继续处理不常见的水印类型。