From 886f932aeeb4c0ed3bb6ccb6ba4da45f9fd29a6f Mon Sep 17 00:00:00 2001 From: Ryan Harrison Date: Fri, 16 Feb 2018 20:02:50 +0000 Subject: Correct mapping text to characters for characters missing from font When parsing text streams there is an internal character list that is generated of all the characters in the stream. Additionally a text string is generated that is exposed via the public API. This string will have all of the printing, i.e. non-control characters, in it. For characters that are not in the font of the stream the unicode, but printable, the character 0xFFFE is used in the text to indicate a missing character. This a non-printing character to indicate non-unicode. The internal character list gets a Unicode value 0x0 when there isn't a glyph in the font for it and the original character code is preserved. This means that when generating the mapping between text string and character list, the code is mistakenly thinking that the unprintable character was not present in the text string. I have changed the check in the mapping generation code to correctly account for this. Additional investigation is needed to determine if inserting 0xFFFE in the text is the correct behaviour. This patch resolves an issue where the find highlights in Chrome for a PDF would be offset when there are unprintable characters in a stream. BUG=pdfium:1010 Change-Id: I7547c46c5645e039a4b5138f2ce1137fa31990a5 Reviewed-on: https://pdfium-review.googlesource.com/27051 Reviewed-by: Henrique Nakashima Commit-Queue: Ryan Harrison --- core/fpdftext/cpdf_textpage.cpp | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/core/fpdftext/cpdf_textpage.cpp b/core/fpdftext/cpdf_textpage.cpp index 16214269ae..e712549ceb 100644 --- a/core/fpdftext/cpdf_textpage.cpp +++ b/core/fpdftext/cpdf_textpage.cpp @@ -181,7 +181,8 @@ void CPDF_TextPage::ParseTextPage() { int indexSize = pdfium::CollectionSize(m_CharIndex); const PAGECHAR_INFO& charinfo = m_CharList[i]; if (charinfo.m_Flag == FPDFTEXT_CHAR_GENERATED || - (charinfo.m_Unicode != 0 && !IsControlChar(charinfo))) { + (charinfo.m_Unicode != 0 && !IsControlChar(charinfo)) || + (charinfo.m_Unicode == 0 && charinfo.m_CharCode != 0)) { if (indexSize % 2) { m_CharIndex.push_back(1); } else { -- cgit v1.2.3