summaryrefslogtreecommitdiff
path: root/fpdfsdk/fpdftext_embeddertest.cpp
AgeCommit message (Collapse)Author
2017-11-30Rewrite lower level details of extracting text from pageRyan Harrison
The current implementation of text extraction was difficult to understand, duplicated logic that existed in other methods, and wasn't clear about the units the inputs were in. It also didn't handle control characters correctly. The new implementation leans on the methods for converting indices between the text buffer index and character list index spaces to avoid duplication of code. It also makes it clear to the reader that inputs are in the character list index space. Finally, it fixes issues being seen in Chrome with respect of ranges being slightly off. This CL also adds a test for extracting text that has control characters. BUG=pdfium:942,chromium:654578 Change-Id: Id9d1f360c2d7492c7b5a48d6c9ae29f530892742 Reviewed-on: https://pdfium-review.googlesource.com/20014 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org> Reviewed-by: Henrique Nakashima <hnakashima@chromium.org>
2017-11-29Fix some nits in FPDFText_GetText().Lei Zhang
Use more variables to avoid redundant calculations. Add one more edge test case. Change-Id: I6c8a0aca9de3bdd1a394c39304fd9a75009f9489 Reviewed-on: https://pdfium-review.googlesource.com/19690 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org>
2017-11-27Change FPDF_GetText to return "" when asked to get 0 charactersRyan Harrison
BUG=chromium:788103 Change-Id: I8ebdbc78eb14c358d7ac019b96de4828e6071b79 Reviewed-on: https://pdfium-review.googlesource.com/19350 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org>
2017-11-20Add regression tests for issues with correctly removing hyphensRyan Harrison
There was a regression due to a refactor, where the public API was no longer removing soft hyphens for line broken words. This was causing issues with find and copy/paste operations that depend on selecting a region of text. This change is covered by FPDFTextEmbeddertest.GetTextWithHyphen. FPDFTextEmbeddertest.bug_782596 is a regression test for a bug that was introduced by the original fix. It only fails when running the test under ASAN. BUG=pdfium:935 Change-Id: I26096583c35f9246a3662e702f89b742f1146780 Reviewed-on: https://pdfium-review.googlesource.com/18610 Reviewed-by: Lei Zhang <thestig@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2017-10-24Add more tests for FPDFText methods.Lei Zhang
BUG=pdfium:921 Change-Id: I6973359e6ac112c56843f66eb0b70462f42f9cae Reviewed-on: https://pdfium-review.googlesource.com/16630 Reviewed-by: Ryan Harrison <rharrison@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2017-09-05Leave space for null characters when getting textRyan Harrison
The conversion from WideString to ByeString adds in null characters at the end, so we need to account for these when selecting the range of text to initially extract. BUG=chromium:761770,chromium:761626 Change-Id: Ib8f863e997ebccaaf882e0beb29733f27a18826d Reviewed-on: https://pdfium-review.googlesource.com/13110 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org>
2017-08-31Make FPDF_GetText stricter on inputsRyan Harrison
The current implementation of this function is problematic. It will attempt to memcpy to NULL. It will accept obviously wrong inputs like a negative start index. It will also accept -1 for the count, which in theory is the amount of space the buffer has allocated to it, so doesn't make sense, but instead an internal call will calculate the number of characters to get if the count is -1. This will them lead to the function attempting to call Left(-1) on a string, which is invalid. Ths documentation for this function mentions none of this behaviour, so I am removing it, since it is inconsistent/bad. The implementation should now more strictly meet defined API. BUG=pdfium:828 Change-Id: I18afdb33e12d77c10d856b4bacd615481979c484 Reviewed-on: https://pdfium-review.googlesource.com/12733 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Tom Sepez <tsepez@chromium.org>
2017-08-31Remove fx_basic.hDan Sinclair
This CL removes the fx_basic.h header and fixes up includes as needed. Change-Id: I49af32a8327bdbcda40c50a61ffbd75d06609040 Reviewed-on: https://pdfium-review.googlesource.com/12670 Commit-Queue: dsinclair <dsinclair@chromium.org> Reviewed-by: Tom Sepez <tsepez@chromium.org>
2017-08-12Add a new public method to get the the origin of a character.Andrew Weintraub
Bug: Change-Id: I376f4af26791cd4ed04049ab179c2b39dd262725 Reviewed-on: https://pdfium-review.googlesource.com/10690 Reviewed-by: Lei Zhang <thestig@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2017-05-20Better identify web links by trimming irrelevant charschromium/3107Wei Li
Sometimes, web links are written with other text such as punctuations which makes the extracted web links invalid. We improve this by trimming invalid chars at the end of host name only URLs. For example, host names never ends with ';' or ','. BUG=chromium:720578 Change-Id: Id619025b2153531376d268a69a3a89c3d49fce08 Reviewed-on: https://pdfium-review.googlesource.com/5692 Commit-Queue: Wei Li <weili@chromium.org> Reviewed-by: Lei Zhang <thestig@chromium.org>
2017-03-17Handle web links across lineschromium/3045Wei Li
When a web link has a hyphen at the end of line, we consider it to be continued to the next line. For example, "http://www.abc.com/my-\r\ntest" should be extracted as "http://www.abc.com/my-test". BUG=pdfium:650 Change-Id: I64a93d9c66faf2be0abdaf8cfe8ee496c435d0ca Reviewed-on: https://pdfium-review.googlesource.com/3092 Commit-Queue: Wei Li <weili@chromium.org> Reviewed-by: Lei Zhang <thestig@chromium.org>
2016-11-21Fixup lint flags.Dan Sinclair
The -build/include setting was masking out build/include_what_you_use. This CL restores them, fixes any build errors, and adds NOLINT as needed. As well, the runtime/explicit and runtime/printf flags are aslo enabled and NOLINT'd. lint cleanups Change-Id: Ib013b3eb29c8d0e48cad74c5df9028684130719f Reviewed-on: https://pdfium-review.googlesource.com/2030 Reviewed-by: Tom Sepez <tsepez@chromium.org>
2016-09-29Move core/fxcrt/include to core/fxcrtdsinclair
BUG=pdfium:611 Review-Url: https://codereview.chromium.org/2382723003
2016-09-15Use ToUnicode mapping even when unicode is 0.npm
CPDF_Font::UnicodeFromCharcode returns 0 only if ToUnicode map maps the charcode to 0. CPDF_SimpleFont::UnicodeFromCharcode and CPDF_CID_Font:: UnicodeFromCharCode return 0 only if the call to CPDF_Font returns 0. In other cases, these methods return an empty string. So when processing text, a 0 return from the method should not be replaced with the charcode. BUG=pdfium:583 Review-Url: https://codereview.chromium.org/2342073002
2016-06-07Get rid of NULLs in core/thestig
Review-Url: https://codereview.chromium.org/2032613003
2016-03-29Code change to avoid signed/unsigned mismatch warnings.Wei Li
This makes pdfium code on Linux and Mac sign-compare warning free. The warning flag will be re-enabled after checking on windows clang build. BUG=pdfium:29 R=tsepez@chromium.org Review URL: https://codereview.chromium.org/1841643002 .
2016-03-23Move core/include/fxcrt to core/fxcrt/include.Dan Sinclair
This CL moves the fxcrt code into the core/fxcrt directory. The only exception was fx_bidi.h which was moved into core/fxcrt as it is not used outside of core/. R=tsepez@chromium.org Review URL: https://codereview.chromium.org/1825953002 .
2016-03-14Move fpdfsdk/src up to fpdfsdk/.Dan Sinclair
This CL moves the files in fpdfsdk/src/ up one level to fpdfsdk/ and fixes up the include paths, include guards and build files. R=tsepez@chromium.org Review URL: https://codereview.chromium.org/1799773002 .