summaryrefslogtreecommitdiff
path: root/fpdfsdk/fpdftext.cpp
AgeCommit message (Collapse)Author
2018-01-04Convert usages of pdfium::Optional to OptionalRyan Harrison
Change-Id: I29769f78eaad10c6a8b79e27524336c4f330377e Reviewed-on: https://pdfium-review.googlesource.com/22258 Reviewed-by: Tom Sepez <tsepez@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2017-11-30Rewrite lower level details of extracting text from pageRyan Harrison
The current implementation of text extraction was difficult to understand, duplicated logic that existed in other methods, and wasn't clear about the units the inputs were in. It also didn't handle control characters correctly. The new implementation leans on the methods for converting indices between the text buffer index and character list index spaces to avoid duplication of code. It also makes it clear to the reader that inputs are in the character list index space. Finally, it fixes issues being seen in Chrome with respect of ranges being slightly off. This CL also adds a test for extracting text that has control characters. BUG=pdfium:942,chromium:654578 Change-Id: Id9d1f360c2d7492c7b5a48d6c9ae29f530892742 Reviewed-on: https://pdfium-review.googlesource.com/20014 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org> Reviewed-by: Henrique Nakashima <hnakashima@chromium.org>
2017-11-29Fix some nits in FPDFText_GetText().Lei Zhang
Use more variables to avoid redundant calculations. Add one more edge test case. Change-Id: I6c8a0aca9de3bdd1a394c39304fd9a75009f9489 Reviewed-on: https://pdfium-review.googlesource.com/19690 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org>
2017-11-28Convert from character stream to visible text indices in GetTextRyan Harrison
Most of the API methods FPDFText operate on indices in terms of the underlying stream of characters. This stream includes non-printing control characters, which are not part of the visible text. The majority of files do not appear to have these hidden characters so there is a 1:1 correspondence between them. When they are present conversion needs to occur to make sure that GetText doesn't attempt to retrieve for a span that is out of range. BUG=chromium:788103,chromium:788220 Change-Id: I4c9fa403ea65754ba94e3f15ded49fe0641e9db5 Reviewed-on: https://pdfium-review.googlesource.com/19550 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: dsinclair <dsinclair@chromium.org>
2017-11-27Change FPDF_GetText to return "" when asked to get 0 charactersRyan Harrison
BUG=chromium:788103 Change-Id: I8ebdbc78eb14c358d7ac019b96de4828e6071b79 Reviewed-on: https://pdfium-review.googlesource.com/19350 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org>
2017-09-27Remove FX_STRSIZE and replace with size_tRyan Harrison
BUG=pdfium:828 Change-Id: I5c40237433ebabaeabdb43aec9cdf783e41dfe16 Reviewed-on: https://pdfium-review.googlesource.com/13230 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2017-09-18Convert string class namesRyan Harrison
Automated using git grep & sed. Replace StringC classes with StringView classes. Remove the CFX_ prefix and put string classes in fxcrt namespace. Change AsStringC() to AsStringView(). Rename tests from TEST(fxcrt, *String*Foo) to TEST(*String*, Foo). Couple of tests needed to have their names regularlized. BUG=pdfium:894 Change-Id: I7ca038685c8d803795f3ed02545124f7a224c83d Reviewed-on: https://pdfium-review.googlesource.com/14151 Reviewed-by: Tom Sepez <tsepez@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2017-09-12Remove ASSERT that was added to understand what was occuringRyan Harrison
thestig provided a test PDF that reproduced the issue that is being tested for in the ASSERT. I have spent some time going throught the related code, and the condition in the assert is actually standard behaviour, so shouldn't be asserted. The following code gracefully handles the case of more text then requested being returned. BUG=chromium:763369 Change-Id: I5bc121977169deead52fc5dd2503376b1b62d83f Reviewed-on: https://pdfium-review.googlesource.com/13750 Reviewed-by: Lei Zhang <thestig@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2017-09-11Add guard against reading more then expected from the pageRyan Harrison
This really shouldn't ever happen, but there used to be this guard in this code and I am getting reports of crashes after it was removed. I have added an assert, so hopefully if it is actually occuring, then we might get a reproduction case based on a debug build crash. BUG=chromium:763369 Change-Id: Ifaebfbcb0413a1d7777222ba838aaee234f94ae3 Reviewed-on: https://pdfium-review.googlesource.com/13691 Reviewed-by: Tom Sepez <tsepez@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2017-09-05Leave space for null characters when getting textRyan Harrison
The conversion from WideString to ByeString adds in null characters at the end, so we need to account for these when selecting the range of text to initially extract. BUG=chromium:761770,chromium:761626 Change-Id: Ib8f863e997ebccaaf882e0beb29733f27a18826d Reviewed-on: https://pdfium-review.googlesource.com/13110 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org>
2017-08-31Make FPDF_GetText stricter on inputsRyan Harrison
The current implementation of this function is problematic. It will attempt to memcpy to NULL. It will accept obviously wrong inputs like a negative start index. It will also accept -1 for the count, which in theory is the amount of space the buffer has allocated to it, so doesn't make sense, but instead an internal call will calculate the number of characters to get if the count is -1. This will them lead to the function attempting to call Left(-1) on a string, which is invalid. Ths documentation for this function mentions none of this behaviour, so I am removing it, since it is inconsistent/bad. The implementation should now more strictly meet defined API. BUG=pdfium:828 Change-Id: I18afdb33e12d77c10d856b4bacd615481979c484 Reviewed-on: https://pdfium-review.googlesource.com/12733 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Tom Sepez <tsepez@chromium.org>
2017-08-28Convert find markers to Optionals in CPDF_TextPageFindRyan Harrison
Currently these use -1 as a special value to indicate not set. This creates the same issues that FX_STRNPOS created for converting FX_STRSIZE to size_t, so this code has been rewritten. BUG=pdfium:828 Change-Id: Iaaa96af0dcb2eb8b600f3ea39060a398ac9a3800 Reviewed-on: https://pdfium-review.googlesource.com/12130 Reviewed-by: Tom Sepez <tsepez@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2017-08-12Add a new public method to get the the origin of a character.Andrew Weintraub
Bug: Change-Id: I376f4af26791cd4ed04049ab179c2b39dd262725 Reviewed-on: https://pdfium-review.googlesource.com/10690 Reviewed-by: Lei Zhang <thestig@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2017-08-10Rename DLLEXPORT AND STDCALLDan Sinclair
This CL renames DLLEXPORT to FPDF_EXPORT and STDCALL to FPDF_CALLCONV to be more PDFium specific. This is split off of https://pdfium-review.googlesource.com/c/8970 by Felix Kauselmann. Bug: pdfium:825 Change-Id: I0aea9d43f1714b1e10e935c4a7eea685a5ad8998 Reviewed-on: https://pdfium-review.googlesource.com/10610 Reviewed-by: Tom Sepez <tsepez@chromium.org> Commit-Queue: dsinclair <dsinclair@chromium.org>
2017-08-10Revert "Add a build target and a proper export header for shared library ↵Henrique Nakashima
builds." This reverts commit 00334675c18a0203f313cceb670c970a77280f49. Reason for revert: Breaking the deps roller - https://chromium-review.googlesource.com/c/609307 Original change's description: > Add a build target and a proper export header for shared library builds. > > This CL adds support for Chromium's component build feature to pdfium. > > The export header stub in fpdfview.h is expanded to match Chromium's > export mechanisms and patterns (fixes pdfium:825). > > A component/shared library build can be triggered by adding > > is_component_build = true > > as a gn argument. Please note that setting this will also affect some > of pdfiums dependencies like v8, which will be build as components > too. > > Additionally, this CL provides a "pdf_source_set" template which > dynamically enables the use of "source_set" when building a complete > static library or a shared library to reduce build time. > > When testing this it is recommended to only build the pdfium target as > most of pdfiums test rely on non-public functions which aren't exported > by the shared library. > > Bug: pdfium:825,pdfium:826 > Change-Id: Icedc538ec535e11d1e53c4d5fabc8c064b275752 > Reviewed-on: https://pdfium-review.googlesource.com/8970 > Reviewed-by: dsinclair <dsinclair@chromium.org> > Commit-Queue: dsinclair <dsinclair@chromium.org> TBR=thestig@chromium.org,tsepez@chromium.org,brucedawson@chromium.org,dsinclair@google.com,dsinclair@chromium.org,licorn@gmail.com Change-Id: Ib02af2298932481293f50d362ae87bfedf284821 No-Presubmit: true No-Tree-Checks: true No-Try: true Bug: pdfium:825, pdfium:826 Reviewed-on: https://pdfium-review.googlesource.com/10550 Reviewed-by: Henrique Nakashima <hnakashima@chromium.org> Commit-Queue: Henrique Nakashima <hnakashima@chromium.org>
2017-08-09Add a build target and a proper export header for shared library builds.Felix Kauselmann
This CL adds support for Chromium's component build feature to pdfium. The export header stub in fpdfview.h is expanded to match Chromium's export mechanisms and patterns (fixes pdfium:825). A component/shared library build can be triggered by adding is_component_build = true as a gn argument. Please note that setting this will also affect some of pdfiums dependencies like v8, which will be build as components too. Additionally, this CL provides a "pdf_source_set" template which dynamically enables the use of "source_set" when building a complete static library or a shared library to reduce build time. When testing this it is recommended to only build the pdfium target as most of pdfiums test rely on non-public functions which aren't exported by the shared library. Bug: pdfium:825,pdfium:826 Change-Id: Icedc538ec535e11d1e53c4d5fabc8c064b275752 Reviewed-on: https://pdfium-review.googlesource.com/8970 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: dsinclair <dsinclair@chromium.org>
2017-05-22Convert more c-style pointers to CFX_UnownedPtrTom Sepez
Change-Id: I551b4210c95db0b916e9fe6cddf11e6c3d015c50 Reviewed-on: https://pdfium-review.googlesource.com/5790 Reviewed-by: Lei Zhang <thestig@chromium.org> Commit-Queue: Tom Sepez <tsepez@chromium.org>
2017-04-03Drop FXSYS_ from mem methodsDan Sinclair
This Cl drops the FXSYS_ from mem methods which are the same on all platforms. Bug: pdfium:694 Change-Id: I9d5ae905997dbaaec5aa0b2ae4c07358ed9c6236 Reviewed-on: https://pdfium-review.googlesource.com/3613 Reviewed-by: Tom Sepez <tsepez@chromium.org> Commit-Queue: dsinclair <dsinclair@chromium.org>
2017-03-14Replace FX_FLOAT with underlying float type.Dan Sinclair
Change-Id: I158b7d80b0ec28b742a9f2d5a96f3dde7fb3ab56 Reviewed-on: https://pdfium-review.googlesource.com/3031 Commit-Queue: dsinclair <dsinclair@chromium.org> Reviewed-by: Tom Sepez <tsepez@chromium.org> Reviewed-by: Nicolás Peña <npm@chromium.org>
2017-02-21Remove non CFX_PointF GetIndexAtPosDan Sinclair
This Cl removes the overload that takes a x,y position and uses the CFX_PointF version in all cases. Change-Id: Iebd048d91a1e1ed1c90ec0019e19750afe06e745 Reviewed-on: https://pdfium-review.googlesource.com/2812 Commit-Queue: dsinclair <dsinclair@chromium.org> Reviewed-by: Tom Sepez <tsepez@chromium.org> Reviewed-by: Nicolás Peña <npm@chromium.org>
2016-11-21Fixup lint flags.Dan Sinclair
The -build/include setting was masking out build/include_what_you_use. This CL restores them, fixes any build errors, and adds NOLINT as needed. As well, the runtime/explicit and runtime/printf flags are aslo enabled and NOLINT'd. lint cleanups Change-Id: Ib013b3eb29c8d0e48cad74c5df9028684130719f Reviewed-on: https://pdfium-review.googlesource.com/2030 Reviewed-by: Tom Sepez <tsepez@chromium.org>
2016-11-02Remove FX_BOOL from fpdfsdk.tsepez
Review-Url: https://codereview.chromium.org/2453683011
2016-11-02Rename CPDFXFA_Document to CPDFXFA_Contextdsinclair
The CPDFXFA_Document class isn't a document, it contains documents. Renamed to make the purpose a bit clearer. Review-Url: https://codereview.chromium.org/2469813004
2016-10-04Rename fpdfsdk/fpdfxfa files to match contentschromium/2881dsinclair
Each of these files contains a single class, rename the file to match the internal class name. Review-Url: https://codereview.chromium.org/2385423004
2016-10-04Move core/fpdfapi/fpdf_page to core/fpdfapi/pagedsinclair
BUG=pdfium:603 Review-Url: https://codereview.chromium.org/2386423004
2016-09-29Move fpdfsdk/include to fpdfsdkdsinclair
BUG=pdfium:611 Review-Url: https://codereview.chromium.org/2384503003
2016-09-29Move fpdfsdk/fpdfxfa/include to fpdfsdk/fpdfxfadsinclair
BUG=pdfium:611 Review-Url: https://codereview.chromium.org/2381993002
2016-09-29Move core/fpdftext/include to core/fpdftextdsinclair
BUG=pdfium:611 Review-Url: https://codereview.chromium.org/2383563002
2016-09-29Move core/fpdfdoc/include to core/fpdfdocdsinclair
BUG=pdfium:611 Review-Url: https://codereview.chromium.org/2374383003
2016-09-29Move core/fpdfapi/fpdf_page/include to core/fpdfapi/fpdf_pagedsinclair
BUG=pdfium:611 Review-Url: https://codereview.chromium.org/2379033002
2016-07-28Split fpdfdoc/include/fpdf_doc.h into individual classes.dsinclair
This CL splits the header file apart. The cpp files are not touched as part of this CL, they will be done as a followup. This de-duplicates the fpdf_doc.h BUG=pdfium:249 Review-Url: https://codereview.chromium.org/2183313004
2016-05-04More define cleanup.dsinclair
This CL converts defines into constants, enums, enum classes or removes them as needed. Review-Url: https://codereview.chromium.org/1938163002
2016-04-21Replace CFX_RectArray with std::vector<CFX_FloatRect>tsepez
Use RVO now that we use an array type compatible with it. Review URL: https://codereview.chromium.org/1906903002
2016-04-21Remove CFX_ArrayTemplate from CPDF_LinkExtracttsepez
Use unqiue_ptrs while we're at it, also better ctor. Review URL: https://codereview.chromium.org/1896303002
2016-04-19Remove IPDF_TextPage, IPDF_TextPageFind and IPDF_LinkExtract interfaces.dsinclair
Each was only used by one subclass. Removed and used the concrete classes. BUG=pdfium:468 Review URL: https://codereview.chromium.org/1897993002
2016-04-06Move remaining fpdfdoc filesdsinclair
This CL moves the last two fpdfdoc files from core/include/fpdfdoc to core/fpdfdoc/include. Review URL: https://codereview.chromium.org/1864163002
2016-04-06Move code from fpdfsdk/include to the sub directories.dsinclair
This CL moves the fxedit, jsapi and fpdfxfa code out of fpdfsdk/include to the various sub-include directories. Review URL: https://codereview.chromium.org/1863163002
2016-03-21Re-enable several MSVC warningsWei Li
Re-enable the following warnings: 4245: signed/unsigned conversion mismatch; 4310: cast may truncate data; 4389: operator on signed/unsigned mismatch; 4701: use potentially uninitialized local variable; 4706: assignment within conditional expression Clean up the code to avoid those warnings. BUG=pdfium:29 R=tsepez@chromium.org Review URL: https://codereview.chromium.org/1801383002 .
2016-03-16Move core/include/fpdfapi/fpdf_page.h to correct locations.Dan Sinclair
This CL splits the fpdf_page.h header into the individual classes and moves them to the correct core/fpdfapi locations. R=tsepez@chromium.org Review URL: https://codereview.chromium.org/1805663002 .
2016-03-14Move fx_crypto.h and fpdf_text.h out of core/include.Dan Sinclair
This CL moves the two files and breaks fpdf_text.h apart into individual pieces. R=tsepez@chromium.org Review URL: https://codereview.chromium.org/1801973002 .
2016-03-14Move fpdfsdk/src up to fpdfsdk/.Dan Sinclair
This CL moves the files in fpdfsdk/src/ up one level to fpdfsdk/ and fixes up the include paths, include guards and build files. R=tsepez@chromium.org Review URL: https://codereview.chromium.org/1799773002 .