diff options
author | Robin Watts <robin.watts@artifex.com> | 2018-01-01 17:24:42 +0000 |
---|---|---|
committer | Robin Watts <robin.watts@artifex.com> | 2018-01-05 11:47:08 +0000 |
commit | 25593f4f9df0c4a9b9adaa84aaa33fe2a89087f6 (patch) | |
tree | 207c75e3a1bb4b05e83846762e3cf5fb030a0eed /CHANGES | |
parent | 1202a24a5b2729093545a89d013eaef1557a5fe9 (diff) | |
download | mupdf-25593f4f9df0c4a9b9adaa84aaa33fe2a89087f6.tar.xz |
Fix "being able to search for redacted text" bug.
A customer reports that even after text has been redacted, we can
still search for the redacted text. The example file supplied had
many instances of the word 'words', and 4 instances of 'apple'.
The 'apple' instances were redacted, and the document saved out.
2 such instances were on the first page; when we searched for
'apple' acrobat would find the word after the first removed
instance of apple, then find the word 2 after the second removed
instance of apple.
After much head scratching and cutting down of the file, it
appears that the information genuinely isn't in the file. Acrobat
is somehow remembering it. It appears to be doing this using the
'ID' entries in the trailer dict.
My suspicion is that Acrobat has cached the text extraction from
the original document, and is using this on all files that match
the IDs. Change the IDs (or remove them) and the problem goes away.
The spec says that the ID should be 2 bytestrings in an array. The
first is supposed to stay the same in all versions of a file (i.e.
it shows the *original* version of the file, and it is the one that
is used by encrypt).
The second bytestring is supposed to change more often, so here we
simply return a new random string on each writing.
Diffstat (limited to 'CHANGES')
0 files changed, 0 insertions, 0 deletions