summaryrefslogtreecommitdiff
path: root/core/fpdftext/cpdf_textpage.cpp
AgeCommit message (Collapse)Author
2018-06-08Use FPDFText_GetBoundedText() to get the visible text in a test.Lei Zhang
Add a test PDF with multiple pages, each with a different media box and crop box. Demonstrate how FPDFText_GetText() gets all the text on the page, and how FPDFText_GetBoundedText() with the right bounding boxes gets only the visible text on the page. Also fix a small nit in CPDF_TextPage::GetTextByRect() found while writing this CL. BUG=pdfium:387 Change-Id: I9ce4bb181e2ba5b454ea1341bbccef9ba94c9cd8 Reviewed-on: https://pdfium-review.googlesource.com/34550 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org>
2018-05-15Return earlier in CPDF_TextPage::ProcessInsertObject() when possible.Lei Zhang
Change-Id: Ibc446d9f4606d29f997ee3521f31f0fbcf1ad84b Reviewed-on: https://pdfium-review.googlesource.com/32176 Commit-Queue: Lei Zhang <thestig@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org>
2018-05-15Simplify various functions in CPDF_TextPage.Lei Zhang
Break out parts of ProcessTextObject() into helper functions. Change-Id: I76ad141728ce77e8d24d7dc9635722d670663f39 Reviewed-on: https://pdfium-review.googlesource.com/32175 Reviewed-by: Ryan Harrison <rharrison@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2018-05-08Simplify more code in CPDF_TextObject.Lei Zhang
- Move code into a GenerateSpace() function. - Break apart some font size conversions. Change-Id: I4d5ea112fc004a31ac38b7c19ff77fcbfe764d38 Reviewed-on: https://pdfium-review.googlesource.com/32157 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org>
2018-05-07Move some CPDF_TextPage methods into an anonymous namespace.Lei Zhang
Change-Id: I959d687d7d46fa61e1fe097b0b876ad02d2b123c Reviewed-on: https://pdfium-review.googlesource.com/32153 Commit-Queue: Lei Zhang <thestig@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org>
2018-05-07Initialize CPDF_TextPage members in the header.Lei Zhang
Change-Id: I667a3cd696d44692fa3d73bdee7c2f48d3039255 Reviewed-on: https://pdfium-review.googlesource.com/32152 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org>
2018-05-01Clean up CFX_BidiString.Lei Zhang
- Refer to the string in CFX_BidiString by const-ref. - Remove useless CharAt() method. - Turn a member variable into a local variable. Change-Id: I30f221b7350150c839a793129789d8ea7cc1f331 Reviewed-on: https://pdfium-review.googlesource.com/31670 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2018-04-30Reorganize Unicode_GetNormalization() some more.Lei Zhang
Change-Id: I183a53d08f5da73d788c92b53382e3fac3b823e2 Reviewed-on: https://pdfium-review.googlesource.com/31671 Commit-Queue: Lei Zhang <thestig@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org>
2018-04-30Add CPDF_TextPage::GetPrevCharInfo() helper method.Lei Zhang
Change-Id: Ie5bea82757682390b274ad2da77d1686cc597046 Reviewed-on: https://pdfium-review.googlesource.com/31657 Reviewed-by: Ryan Harrison <rharrison@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2018-04-30Simplify Unicode_GetNormalization() and caller.Lei Zhang
Change-Id: I9a5acb59790fd8527ced745370bdfe35e4d21c36 Reviewed-on: https://pdfium-review.googlesource.com/31656 Commit-Queue: Lei Zhang <thestig@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org>
2018-04-30Fix some nits in CPDF_TextPage.Lei Zhang
Change-Id: Ib0b1d014af31493c73a74d81c1f3454a203da949 Reviewed-on: https://pdfium-review.googlesource.com/31655 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org>
2018-04-12Change GetDisplayMatrix methods to take FX_RECT.Lei Zhang
Change-Id: I079bc3bf1242fd28fdd51930d9deb6efa34d7509 Reviewed-on: https://pdfium-review.googlesource.com/30055 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2018-04-09Use ByteStringView / pdfium::span in CPDF font as appropriate.chromium/3393Tom Sepez
Change-Id: I92c7ba605bf95a9023ad046b8dddebe0a0592802 Reviewed-on: https://pdfium-review.googlesource.com/29992 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Tom Sepez <tsepez@chromium.org>
2018-03-21Use more const pointers in CPDF_ContentMarkItem.Lei Zhang
Transitively mark the same pointers as const in callers. Change-Id: I1f9669b35c6d7f4b1a11c25163480bc687fbc7f8 Reviewed-on: https://pdfium-review.googlesource.com/28870 Reviewed-by: Henrique Nakashima <hnakashima@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2018-03-20Simplify code in CPDF_TextPage.Lei Zhang
- Augment/reuse existing NormalizeThreshold() function. - Use std::min() / std::max(). Change-Id: I709e246c58bf8a69638e1d77dcc2a79ae8a27e77 Reviewed-on: https://pdfium-review.googlesource.com/28736 Commit-Queue: Lei Zhang <thestig@chromium.org> Reviewed-by: Henrique Nakashima <hnakashima@chromium.org>
2018-03-19Remove non-const ref parameters in CPDF_TextPage.Lei Zhang
Change-Id: Ia6cb19269ef93440669b68d76c3c378a5d4da7a5 Reviewed-on: https://pdfium-review.googlesource.com/28735 Reviewed-by: Henrique Nakashima <hnakashima@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2018-03-19Remove redundant code in CPDF_TextPage::CountRects().Lei Zhang
CountRects() calls GetRectArray(), which performs the same calculation. Change-Id: I79dcd8e82f6d0fe7ed992da06237f31d0761a902 Reviewed-on: https://pdfium-review.googlesource.com/28734 Commit-Queue: Lei Zhang <thestig@chromium.org> Reviewed-by: Henrique Nakashima <hnakashima@chromium.org>
2018-03-19Avoid crashing in FPDFText_CountRects() for invalid start values.Lei Zhang
BUG=chromium:821305 Change-Id: I371572f60ea3984ce044e25125d882b3c2d03115 Reviewed-on: https://pdfium-review.googlesource.com/28733 Commit-Queue: Lei Zhang <thestig@chromium.org> Reviewed-by: Henrique Nakashima <hnakashima@chromium.org>
2018-03-19Avoid crashing in FPDFText_CountRects() for negative count values.Lei Zhang
Treat values less than -1 as -1. BUG=chromium:821305 Change-Id: Ieaced045473fa51097400e5af1286f0d3f4d0143 Reviewed-on: https://pdfium-review.googlesource.com/28732 Reviewed-by: Henrique Nakashima <hnakashima@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2018-03-12Remove all usages of FXSYS_iswASCIIalphaRyan Harrison
Instances are either replaced with FXSYS_iswalpha, which calls out to the ICU library to do the proper Unicode operations, or have been converted to a isascii && isalpha pair, if ASCII alpha is actually what was wanted. BUG=pdfium:1035 Change-Id: I971ff639ee1ff818ad08793a1900a8bcbb0a3e04 Reviewed-on: https://pdfium-review.googlesource.com/28450 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2018-03-12Remove all usages of FXSYS_iswASCIIalnumRyan Harrison
Instances are either replaced with FXSYS_iswalnum, which calls out to the ICU library to do the proper Unicode operations, or have been converted to a isascii && isalnum pair, if ASCII alnum is actually what was wanted. BUG=pdfium:1035 Change-Id: I959ec8739a4d020e61562180393ab8113a81577c Reviewed-on: https://pdfium-review.googlesource.com/28430 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2018-03-09Explicitly mark helper methods that only operate on ASCII rangesRyan Harrison
A number of our character helper methods take in wide character types, but only do tests/operations on the ASCII range of characters. As a very quick first pass I am renaming all of the foot-gun methods to explictly call out this behaviour, while I do a bigger cleanup/refactor. BUG=pdfium:1035 Change-Id: Ia035dfa1cb6812fa6d45155c4565475032c4c165 Reviewed-on: https://pdfium-review.googlesource.com/28330 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Commit-Queue: dsinclair <dsinclair@chromium.org> Reviewed-by: Henrique Nakashima <hnakashima@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org>
2018-02-16Correct mapping text to characters for characters missing from fontRyan Harrison
When parsing text streams there is an internal character list that is generated of all the characters in the stream. Additionally a text string is generated that is exposed via the public API. This string will have all of the printing, i.e. non-control characters, in it. For characters that are not in the font of the stream the unicode, but printable, the character 0xFFFE is used in the text to indicate a missing character. This a non-printing character to indicate non-unicode. The internal character list gets a Unicode value 0x0 when there isn't a glyph in the font for it and the original character code is preserved. This means that when generating the mapping between text string and character list, the code is mistakenly thinking that the unprintable character was not present in the text string. I have changed the check in the mapping generation code to correctly account for this. Additional investigation is needed to determine if inserting 0xFFFE in the text is the correct behaviour. This patch resolves an issue where the find highlights in Chrome for a PDF would be offset when there are unprintable characters in a stream. BUG=pdfium:1010 Change-Id: I7547c46c5645e039a4b5138f2ce1137fa31990a5 Reviewed-on: https://pdfium-review.googlesource.com/27051 Reviewed-by: Henrique Nakashima <hnakashima@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2018-01-30Use unsigned for char widthchromium/3335Nicolas Pena
Bug: 806612 Change-Id: I22bd9046dd37a1b596762c46a6b29a323d6e9fa1 Reviewed-on: https://pdfium-review.googlesource.com/24410 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Nicolás Peña Moreno <npm@chromium.org>
2018-01-29Fix typo introduced in cleanup of IsHyphenRyan Harrison
Introduced here https://pdfium-review.googlesource.com/#/c/17950/5/core/fpdftext/cpdf_textpage.cpp@1237 BUG=chromium:805881 Change-Id: I0c9109f3eebec968360734ff4d9d0542881d6823 Reviewed-on: https://pdfium-review.googlesource.com/24210 Commit-Queue: dsinclair <dsinclair@chromium.org> Reviewed-by: Lei Zhang <thestig@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org>
2018-01-11Change FPDFText_GetRect() to return a boolean.Lei Zhang
BUG=pdfium:858 Change-Id: Idc9900fe6f85b1fef06c97f5023653f77156d410 Reviewed-on: https://pdfium-review.googlesource.com/22730 Commit-Queue: dsinclair <dsinclair@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org>
2017-12-01Get rid of else after break/continue/return.chromium/3284chromium/3283Lei Zhang
Change-Id: I3efc57cd7325d16e3ca8ebdeeaec06012b2c56e3 Reviewed-on: https://pdfium-review.googlesource.com/20110 Reviewed-by: Henrique Nakashima <hnakashima@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2017-11-30Rewrite lower level details of extracting text from pageRyan Harrison
The current implementation of text extraction was difficult to understand, duplicated logic that existed in other methods, and wasn't clear about the units the inputs were in. It also didn't handle control characters correctly. The new implementation leans on the methods for converting indices between the text buffer index and character list index spaces to avoid duplication of code. It also makes it clear to the reader that inputs are in the character list index space. Finally, it fixes issues being seen in Chrome with respect of ranges being slightly off. This CL also adds a test for extracting text that has control characters. BUG=pdfium:942,chromium:654578 Change-Id: Id9d1f360c2d7492c7b5a48d6c9ae29f530892742 Reviewed-on: https://pdfium-review.googlesource.com/20014 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org> Reviewed-by: Henrique Nakashima <hnakashima@chromium.org>
2017-11-27Convert CPDF_TextObject to not use CollectionSizeDan Sinclair
This CL updates various methods in CPDF_TextObject to return or received size_t values. Callers have been updated as needed. Bug: pdfium:774 Change-Id: Id72511bc74637c6261add39f5414c9a4b8390b82 Reviewed-on: https://pdfium-review.googlesource.com/19430 Commit-Queue: dsinclair <dsinclair@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org>
2017-11-22Change CPDF_ContentMark to return size_t for counts.Lei Zhang
Change-Id: I45468fa7944290fbbe3d2e67f884164ae8d84160 Reviewed-on: https://pdfium-review.googlesource.com/19171 Reviewed-by: Tom Sepez <tsepez@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2017-11-08Make sure that iterator stays in bounds during IsHypenRyan Harrison
BUG=chromium:782596,chromium:781804 Change-Id: I020be3cf813221bb8314f045d83014a25cb9a950 Reviewed-on: https://pdfium-review.googlesource.com/18070 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: dsinclair <dsinclair@chromium.org>
2017-11-07Revert cleanup of IsHyphen and reimplementRyan Harrison
The original version of this code was landed in https://pdfium-review.googlesource.com/c/pdfium/+/13690/. A corner case has been found that breaks. In this CL I have reverted the changes in IsHyphen and implemented a less aggressive cleanup that I have tested works as expected. BUG=chromium:781804 Change-Id: I3b36f420834081fdd9e1ae17efc234b561b4df41 Reviewed-on: https://pdfium-review.googlesource.com/17950 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org>
2017-11-02Encapsulate CPDF_FormObject members.Lei Zhang
Also remove a conditional in the CPDF_Form ctor that cannot be true. Change-Id: Icd00233969cea33e9c63d0d6a9d07226c2b173f2 Reviewed-on: https://pdfium-review.googlesource.com/17070 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2017-10-23Fix cpdf_textpage so it doesn't omit spaces.Andrew Weintraub
Bug: pdfium:921 Change-Id: I8864fd2ebdccc5f94aaf70cd8295068bf4db8b68 Reviewed-on: https://pdfium-review.googlesource.com/16492 Reviewed-by: Lei Zhang <thestig@chromium.org> Reviewed-by: Ryan Harrison <rharrison@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2017-09-27Remove FX_STRSIZE and replace with size_tRyan Harrison
BUG=pdfium:828 Change-Id: I5c40237433ebabaeabdb43aec9cdf783e41dfe16 Reviewed-on: https://pdfium-review.googlesource.com/13230 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2017-09-18Convert string class namesRyan Harrison
Automated using git grep & sed. Replace StringC classes with StringView classes. Remove the CFX_ prefix and put string classes in fxcrt namespace. Change AsStringC() to AsStringView(). Rename tests from TEST(fxcrt, *String*Foo) to TEST(*String*, Foo). Couple of tests needed to have their names regularlized. BUG=pdfium:894 Change-Id: I7ca038685c8d803795f3ed02545124f7a224c83d Reviewed-on: https://pdfium-review.googlesource.com/14151 Reviewed-by: Tom Sepez <tsepez@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2017-09-13Rewrite IsHyphen using string operationsRyan Harrison
The existing code did end of range checks by making sure that the value was never less then 0. This isn't correct when using an unsigned type, since 0 - 1 will wrap around to the max possible value, and thus still be less then 0. Additionally the existing code was hard to follow due to the complexity of some of the low level operations being performed. It has been rewritten using higher level string operations to make it clearer and correct. BUG=chromium:763256 Change-Id: Ib8bf5ca0e29e73724c4a1c4781362e8a8fc30149 Reviewed-on: https://pdfium-review.googlesource.com/13690 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Tom Sepez <tsepez@chromium.org>
2017-09-05Split fx_ucd.h into fx_unicode.h and fx_ucddata.h.Tom Sepez
File naming now matches. Fix one usage not going through the accessor function. Change-Id: I5cc4986238764964f2a71807a94bd2facf517263 Reviewed-on: https://pdfium-review.googlesource.com/12930 Commit-Queue: dsinclair <dsinclair@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org>
2017-08-30Convert int* references to FX_STRSIZERyan Harrison
Through out the code base there are numerous places where variables are declared using a signed integer type when interacting with the string classes, since they assume that FX_STRSIZE is 'int'. As part of changing the underling type of FX_STRSIZE to be unsigned, these locations are being changed to use FX_STRSIZE. This is necessary as part of converting the type, but has been broken off into a separate CL, since it should be low risk. Some related cleanups that are low risk are included as part of this CL. BUG=pdfium:828 Change-Id: Ifaae54ad195ccde0fe8672f71271d29a6ebd65fd Reviewed-on: https://pdfium-review.googlesource.com/12210 Reviewed-by: Tom Sepez <tsepez@chromium.org> Reviewed-by: Henrique Nakashima <hnakashima@chromium.org> Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Ryan Harrison <rharrison@chromium.org>
2017-08-22Converted CFX_Matrix::TransformRect() to take in constsJane Liu
Currently, all three of CFX_Matrix::TransformRect() take in rect values and modify them in place. This CL converts them to take in constant values and return the transformed values instead, and fixes all the call sites. Bug=pdfium:874 Change-Id: I9c274df3b14e9d88c100ba0530068e06e8fec32b Reviewed-on: https://pdfium-review.googlesource.com/11550 Reviewed-by: dsinclair <dsinclair@chromium.org> Commit-Queue: Jane Liu <janeliulwq@google.com>
2017-08-15Remove GetAt from string classesRyan Harrison
This method duplicates the behaviour of the const [] operator and doesn't offer any additional safety. Folding them into one implementation. SetAt is retained, since implementing the non-const [] operator to replace SetAt has potential performance concerns. Specifically many non-obvious cases of reading an element using [] will cause a realloc & copy. BUG=pdfium:860 Change-Id: I3ef5e5e5a15376f040256b646eb0d90636e24b67 Reviewed-on: https://pdfium-review.googlesource.com/10870 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Tom Sepez <tsepez@chromium.org>
2017-08-11Add checks of index operations on string classesRyan Harrison
Specifically the index parameter passed in to GetAt(), SetAt() and operator[] are now being tested to be in bounds. BUG=chromium:752480, pdfium:828 Change-Id: I9e94d58c98a8eaaaae53cd0e3ffe2123ea17d8c4 Reviewed-on: https://pdfium-review.googlesource.com/10651 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Tom Sepez <tsepez@chromium.org>
2017-07-28Convert calls to Mid() to Left() or Right() if possibleRyan Harrison
The various string/byte classes support Mid(), Left(), and Right() for extracting substrings. Mid() can handle all possible cases, but Left() and Right() are useful for common cases and more explicit about what is going on. Calls like Mid(offset, length - offset) can be converted to Right(length - offset). Calls like Mid(0, length) can be converted to Left(length). If the substring being extracted does not extend all the way to one of the edges of the string, then Mid() still needs to be used. BUG=pdfium:828 Change-Id: I2ec46ad3d71aac0f7b513e103c69cbe8c854cf62 Reviewed-on: https://pdfium-review.googlesource.com/9510 Commit-Queue: Ryan Harrison <rharrison@chromium.org> Reviewed-by: Tom Sepez <tsepez@chromium.org>
2017-07-27Simplify FX_GetMirrorChar() code.Lei Zhang
Change-Id: I43ec0d4a3b60d51c59ba5a540dfe24803e725089 Reviewed-on: https://pdfium-review.googlesource.com/9170 Reviewed-by: Nicolás Peña <npm@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2017-06-29Change SetReverse to GetInverse in CFX_MatrixNicolas Pena
CFX_Matrix::GetInverse is much clearer. Change-Id: Id10ab1723735332e1a78de853f28415ec3a4d834 Reviewed-on: https://pdfium-review.googlesource.com/7090 Reviewed-by: Lei Zhang <thestig@chromium.org> Commit-Queue: Nicolás Peña <npm@chromium.org>
2017-06-02Remove explicit CFX_Matrix identity matrix instantiations.Lei Zhang
Change-Id: I96c3429dbe2c572ed409706adfe3707b8b9bf51b Reviewed-on: https://pdfium-review.googlesource.com/6176 Reviewed-by: Tom Sepez <tsepez@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>
2017-05-25Mass conversion of remaining class members (non-xfa)Tom Sepez
Change-Id: I8365ba80e3395d59a3cf35dbd9d9162e86e712e3 Reviewed-on: https://pdfium-review.googlesource.com/5970 Commit-Queue: Tom Sepez <tsepez@chromium.org> Reviewed-by: Lei Zhang <thestig@chromium.org>
2017-05-24Convert to CFX_UnownedPtr, part 5Tom Sepez
Change-Id: Ibdb20fca7e4daae9d61286df4801ac02faf3b281 Reviewed-on: https://pdfium-review.googlesource.com/5831 Commit-Queue: Lei Zhang <thestig@chromium.org> Reviewed-by: Lei Zhang <thestig@chromium.org>
2017-05-10Replace operator bool with HasRef() in classes with a CFX_SharedCopyOnWrite ↵Lei Zhang
member. Change-Id: I51e30d298e87b9ae0d5aca83b2f1d6787efce70a Reviewed-on: https://pdfium-review.googlesource.com/5290 Commit-Queue: Lei Zhang <thestig@chromium.org> Reviewed-by: Tom Sepez <tsepez@chromium.org> Reviewed-by: Nicolás Peña <npm@chromium.org>
2017-04-25Use fx_extension.h utilities in more places.Lei Zhang
Change-Id: Iba1aa793567e69acc3cc1acbd5b9a9f531c80b7a Reviewed-on: https://pdfium-review.googlesource.com/4453 Reviewed-by: Tom Sepez <tsepez@chromium.org> Commit-Queue: Lei Zhang <thestig@chromium.org>