diff options
author | Ryan Harrison <rharrison@chromium.org> | 2017-11-30 21:02:41 +0000 |
---|---|---|
committer | Chromium commit bot <commit-bot@chromium.org> | 2017-11-30 21:02:41 +0000 |
commit | 8b357e7504ea804293983453540ae91c9fc57922 (patch) | |
tree | 7b8f611eac73034f9149b014fb547d6886e0d5b7 /testing | |
parent | 0ae8e03cc2d310ba0ba19b878ea448f17a577cdb (diff) | |
download | pdfium-8b357e7504ea804293983453540ae91c9fc57922.tar.xz |
Rewrite lower level details of extracting text from page
The current implementation of text extraction was difficult to
understand, duplicated logic that existed in other methods, and wasn't
clear about the units the inputs were in. It also didn't handle
control characters correctly.
The new implementation leans on the methods for converting indices
between the text buffer index and character list index spaces to avoid
duplication of code. It also makes it clear to the reader that inputs
are in the character list index space. Finally, it fixes issues being
seen in Chrome with respect of ranges being slightly off.
This CL also adds a test for extracting text that has control
characters.
BUG=pdfium:942,chromium:654578
Change-Id: Id9d1f360c2d7492c7b5a48d6c9ae29f530892742
Reviewed-on: https://pdfium-review.googlesource.com/20014
Commit-Queue: Ryan Harrison <rharrison@chromium.org>
Reviewed-by: dsinclair <dsinclair@chromium.org>
Reviewed-by: Henrique Nakashima <hnakashima@chromium.org>
Diffstat (limited to 'testing')
-rw-r--r-- | testing/resources/control_characters.in | 54 | ||||
-rw-r--r-- | testing/resources/control_characters.pdf | 64 |
2 files changed, 118 insertions, 0 deletions
diff --git a/testing/resources/control_characters.in b/testing/resources/control_characters.in new file mode 100644 index 0000000000..ca7827fe11 --- /dev/null +++ b/testing/resources/control_characters.in @@ -0,0 +1,54 @@ +{{header}} +{{object 1 0}} << + /Type /Catalog + /Pages 2 0 R +>> +endobj +{{object 2 0}} << + /Type /Pages + /MediaBox [ 0 0 200 200 ] + /Count 1 + /Kids [ 3 0 R ] +>> +endobj +{{object 3 0}} << + /Type /Page + /Parent 2 0 R + /Resources << + /Font << + /F1 4 0 R + /F2 5 0 R + >> + >> + /Contents 6 0 R +>> +endobj +{{object 4 0}} << + /Type /Font + /Subtype /Type1 + /BaseFont /Times-Roman +>> +endobj +{{object 5 0}} << + /Type /Font + /Subtype /Type1 + /BaseFont /Helvetica +>> +endobj +{{object 6 0}} << +>> +stream +BT +20 50 Td +/F1 12 Tf +(Hello\2\3, world!) Tj +0 50 Td +/F2 16 Tf +(Goodbye, world!) Tj +ET +endstream +endobj +{{xref}} +{{trailer}} +{{startxref}} +%%EOF diff --git a/testing/resources/control_characters.pdf b/testing/resources/control_characters.pdf new file mode 100644 index 0000000000..535009733f --- /dev/null +++ b/testing/resources/control_characters.pdf @@ -0,0 +1,64 @@ +%PDF-1.7 +% ò¤ô +1 0 obj << + /Type /Catalog + /Pages 2 0 R +>> +endobj +2 0 obj << + /Type /Pages + /MediaBox [ 0 0 200 200 ] + /Count 1 + /Kids [ 3 0 R ] +>> +endobj +3 0 obj << + /Type /Page + /Parent 2 0 R + /Resources << + /Font << + /F1 4 0 R + /F2 5 0 R + >> + >> + /Contents 6 0 R +>> +endobj +4 0 obj << + /Type /Font + /Subtype /Type1 + /BaseFont /Times-Roman +>> +endobj +5 0 obj << + /Type /Font + /Subtype /Type1 + /BaseFont /Helvetica +>> +endobj +6 0 obj << +>> +stream +BT +20 50 Td +/F1 12 Tf +(Hello\2\3, world!) Tj +0 50 Td +/F2 16 Tf +(Goodbye, world!) Tj +ET +endstream +endobj +xref +0 7 +0000000000 65535 f +0000000015 00000 n +0000000068 00000 n +0000000161 00000 n +0000000303 00000 n +0000000381 00000 n +0000000457 00000 n +trailer<< /Root 1 0 R /Size 7 >> +startxref +582 +%%EOF |