使用 Python 在 PDF 中查找和高亮显示文本

摘要：PDF（可移植文档格式）文件广泛用于共享和保存原始格式完整的文档。在处理冗长的 PDF 文档时，查找特定信息可能非常耗时。这就是 Find and highlight text （查找并突出显示文本）功能变得无价的地方。通过使用此功能，您可以快速找到相关信息

PDF（可移植文档格式）文件广泛用于共享和保存原始格式完整的文档。在处理冗长的 PDF 文档时，查找特定信息可能非常耗时。这就是 Find and highlight text （查找并突出显示文本）功能变得无价的地方。通过使用此功能，您可以快速找到相关信息、提取重要细节并创建视觉标记以供参考。

要使用 Python 查找和突出显示 PDF 文件中的文本，我们将使用 Spire.PDF for Python。它是一个功能丰富且用户友好的库，旨在在 Python 应用程序中创建、读取、编辑和转换 PDF 文件。

您可以使用以下 pip 命令从 PyPI 安装 Spire.PDF for Python：

pip install Spire.Pdf

如果您已经安装了 Python Spire.PDF，并且想要升级到最新版本，请使用以下 pip 命令：

pip install --upgrade Spire.Pdf

有关安装的更多详细信息，您可以查看此官方文档：如何在 VS Code 中安装 Spire.PDF for Python。

Spire.PDF for Python 中的 PdfTextfinder 类用于在 PDF 文档中搜索文本。使用此类的 Find 方法，您可以在 PDF 页面上找到特定的单词或句子。然后，您可以用亮色突出显示找到的文本的每个实例，并获取实例数和相应的页码。

以下是使用 Python 在 PDF 文档中查找和突出显示文本的步骤：

创建 PdfDocument 类的实例，并使用 PdfDocument.LoadFromFile 方法加载 PDF 文档。初始化一个计数器以跟踪文本实例的数量，并初始化一个列表以存储出现文本的页码。遍历 PDF 中的页面。对于每个页面，创建一个 PdfTextFinder 实例，并通过 PdfTextFinder.Options.Parameter 属性设置文本查找参数（如 WholeWord、IgnoreCase）。使用 PdfTextFinder.Find 方法在页面上搜索特定文本。此方法将返回 PdfTextFragment 对象的列表，每个对象表示文档中文本的一个实例。遍历列表中的 PdfTextFragment 对象。然后使用 PdfTextFragment.Highlight 方法突出显示每个实例，增加文本实例的计数，并将当前页码添加到列表中。使用 PdfDocument.SaveToFile 方法将生成的文档保存到新文件。打印文本实例数和页码。

以下是如何使用 Python 在 PDF 中查找和高亮显示文本的代码示例：

from spire.pdf.common import *from spire.pdf import *# Create an object of the PdfDocument classdoc = PdfDocument# Load a PDF filedoc.LoadFromFile("Adobe Acrobat.pdf")# Initialize a counter to keep track of the number of instancesinstance_count = 0# Initialize a list to store the page numberspage_numbers = # Iterate through the pages in the documentfor i in range(doc.Pages.Count): page = doc.Pages[i] # Create a PdfTextFinder instance finder = PdfTextFinder(page) # Set the text finding parameter finder.Options.Parameter = TextFindParameter.WholeWord # Find a specific text results = finder.Find("Adobe Acrobat") # Highlight all instances of the specific text for text in results: text.HighLight(Color.get_Yellow) # Increment the instance count instance_count += 1 # Add the page number to the list page_numbers.append(i+1)# Save the result filedoc.SaveToFile("FindAndHighlightText.pdf")# Print the number of instances and the page numbersprint(f"The text 'Adobe Acrobat' appears {instance_count} times in the PDF.")print(f"The text appears on the following pages: {', '.join(map(str, page_numbers))}")

在某些情况下，您可能需要在 PDF 页面的特定区域或区域内查找并突出显示文本，而不是整个页面。使用 PdfTextFinder.Options.Area 属性，您可以轻松定义要搜索文本的页面区域。

以下是使用 Python 在特定 PDF 页面区域中查找和突出显示文本的步骤：

创建 PdfDocument 类的实例，并使用 PdfDocument.LoadFromFile 方法加载 PDF 文档。遍历 PDF 中的页面。对于每个页面，创建一个 PdfTextFinder 实例，并将页面区域设置为通过 PdfTextFinder.Options.Area 属性搜索文本。使用 PdfTextFinder.Find 方法在页面区域中搜索特定文本。使用 PdfTextFragment.Highlight 方法高亮显示找到的每个实例。使用 PdfDocument.SaveToFile 方法保存生成的文档。

以下是如何使用 Python 在特定 PDF 页面区域中查找和高亮显示文本的代码示例：

from spire.pdf.common import *from spire.pdf import *# Create an object of the PdfDocument classdoc = PdfDocument# Load a PDF filedoc.LoadFromFile("Adobe Acrobat.pdf")# Iterate through the pages in the documentfor i in range(doc.Pages.Count): page = doc.Pages[i] # Create a PdfTextFinder instance finder = PdfTextFinder(page) # Set the page area to search for text finder.Options.Area = RectangleF(0.0, 0.0, 300.0, 300.0) # Find a specific text results = finder.Find("Adobe Acrobat") # Highlight all instances of the specific text for text in results: text.HighLight(Color.get_Yellow)# Save the resulting filedoc.SaveToFile("FindAndHighlightTextInPageArea.pdf")doc.Close

正则表达式（regex）是执行复杂文本查找的强大工具，允许您根据复杂的模式和规则精确匹配和提取信息。

要在 PDF 中使用正则表达式搜索和突出显示文本，您首先需要将 PdfTextFinder.Options.Parameter 属性设置为 TextFindParameter.Regex 以启用基于正则表达式的搜索。然后，将正则表达式作为参数传递给 Find 方法，实现基于正则表达式的文本搜索。

以下是通过 Python 使用正则表达式在 PDF 中查找和突出显示文本的步骤：

创建 PdfDocument 类的实例，并使用 PdfDocument.LoadFromFile 方法加载 PDF 文档。遍历 PDF 中的页面。对于每个页面，创建一个 PdfTextFinder 实例，并将 PdfTextFinder.Options.Parameter 属性设置为 TextFindParameter.Regex 以启用基于正则表达式的搜索。将正则表达式传递给 PdfTextFinder.Find 方法，以实现基于正则表达式的文本搜索。使用 PdfTextFragment.Highlight 方法突出显示每个匹配的实例。使用 PdfDocument.SaveToFile 方法保存生成的文档。

以下是如何在 Python 中使用正则表达式在 PDF 中查找和高亮显示文本的代码示例：

from spire.pdf.common import *from spire.pdf import *# Create an object of the PdfDocument classdoc = PdfDocument# Load a PDF filedoc.LoadFromFile("Template.pdf")# Iterate through the pages in the documentfor i in range(doc.Pages.Count): page = doc.Pages[i] # Create a PdfTextFinder instance finder = PdfTextFinder(page) # Set the text finding parameter to enable regex-based searching finder.Options.Parameter = TextFindParameter.Regex # Find the text starting with the symbol "#" results = finder.Find("""\\#\\w+\\b""") # Highlight all matched text for text in results: text.HighLight(Color.get_Yellow)# Save the resulting documentdoc.SaveToFile("FindAndHighlightTextUsingRegex.pdf")doc.Close

您可以在 PDF 中查找特定文本，并通过 PdfTextFragment.Positions 检索找到的每个文本实例的坐标。X 和 PdfTextFragment.Positions。Y 属性。

以下是在 PDF 中查找文本并使用 Python 检索找到的每个实例的坐标的步骤：

创建 PdfDocument 类的实例，并使用 PdfDocument.LoadFromFile 方法加载 PDF 文档。遍历 PDF 中的页面。对于每个页面，创建一个 PdfTextFinder 实例。使用 PdfTextFinder.Find 方法搜索特定文本。使用 PdfTextFragment.Positions[0]。X 和 PdfTextFragment.Positions[0] 的 Fields。Y 属性获取每个找到的实例的 X 和 Y 坐标。

以下是如何在 PDF 中查找文本并使用 Python 检索每个找到的实例的坐标的代码示例：

from spire.pdf.common import *from spire.pdf import *# Create an object of the PdfDocument classdoc = PdfDocument# Load a PDF filedoc.LoadFromFile("Adobe Acrobat.pdf")# Iterate through the pages in the documentfor i in range(doc.Pages.Count): page = doc.Pages[i] # Create a PdfTextFinder instance finder = PdfTextFinder(page) # Find a specific text results = finder.Find("Adobe Acrobat") # Print the coordinates of each found instance for text in results: print(f"Text Position: ({text.Positions[0].X}, {text.Positions[0].Y})") doc.Close

来源：自由坦荡的湖泊AI

标签： pdf python 正则表达式

本文地址：http://news.43b.com.cn/a/531073.html

免责声明：本站系转载，并不代表本网赞同其观点和对其真实性负责。如涉及作品内容、版权和其它问题，请在30日内与本站联系，我们将在第一时间删除内容!