印刷文件的来源识别外文翻译资料

2022-11-18 19:47:23

2017 IEEE 3rd International Conference on Collaboration and Internet Computing

Source Identification for Printed Documents^*

Min-Jen Tsai¹, Imam Yuadi^1,2, Jin-Sheng Yin¹ and Yu-Han Tao¹

National Chiao Tung University, Institute of Information Management, 1001 Ta-Hsueh Road, Hsin-Chu, 300, Taiwan, R.O.C.

Airlangga University, Department of Information and Library Science, Jl. Airlangga 4-6 Surabaya 60286, East Java, Indonesia mjtsai@cc.nctu.edu.tw

Abstract— Technological advances in digitization with a variety of image manipulation techniques enable the creation of printed documents illegally. Correspondingly, many researchers conduct studies in determining whether the document printed counterfeit or original. This study examines the several statistical feature sets from Gray Level Co-occurrence Matrix (GLCM), Discrete Wavelet Transform (DWT), Spatial filters, Wiener filter, Gabor filter, Haralick and fractal filters to identify text and image document by using support vector machine (SVM) and decision fusion of feature selection. The average experimental results achieves that the image document is higher identification rate than text document. In summary, the proposed method outperforms the previous researches and it is a promising technique that can be implemented in real forensics for printed documents.

Keywords— Forensics, GLCM, Discrete Wavelet Transform (DWT), Support Vector Machines (SVM), Spatial Filter, Wiener Filter, Gabor Filter, Haralick Filter, Fractal Filter

I. INTRODUCTION

Digital forensics is associated with legal issues in which to identify, collect, analyze and examine digital evidence to prove the occurrence of a crime [1]. An appropriate techniques and materials are needed to identify forensic object accurately and precisely. Even digital content is commonly accepted in real life, printed documents are still widely circulated and accepted. The challenges in the field of forensic investigation still rise to provide appropriate and sufficient security measures and tools [2] in the forensic process to help forensic investigation.

Document authentication is a process to confirm the correctness of document by using a proven technique. In a printed document which is produced by printer still need development to acquire the best way to provide applicable and sufficient security measures and tools for an investigation. Basically, an appropriate method will be able to determine the source of the document because each printer has distinctive textural characteristics than others. Pattern information from a printer can be used as intrinsic signatures of these devices because many small physical differences in the printer such as drifting motor and precision gear that can be seen on the

___________________________________________________

^*This work was partially supported by the National Science Council in Taiwan, Republic of China, under Grant MOST 104-2410-H-009 -020 -MY2 and MOST 106-2410-H-009 -022.

0-7695-6303-1/17/31.00 copy;2017 IEEE

DOI 10.1109/CIC.2017.00019

printed page. Fundamentally, a fluctuation in the angular velocity of the photoreceptor drum from a printer is related to the gearing and the banding frequencies directly reflect on its mechanical properties [3]. Accordingly, every printer has characteristic signature based on a corresponding fluctuation in developed toner on the printed page.

There are some approaches for authenticating printed documents have been conducted. Mikkilineni et.al. [4] applied GLCM to analyze English “e” character to form the feature vectors. Based on examining the printed document, two strategies were developed for printer identification. Finding intrinsic signatures was the first solution for identifying the characteristics of a particular printer, model and manufacturerrsquo;s brand with very high resolution. The other strategy is to detect the extrinsic signature by embedding information into a document with electrophotography (EP) printers in modulating the intrinsic feature. Tsai et al. [5] implemented GLCM and DWT based feature extraction to identify Chinese character and used feature selection to get the optimum feature set for printer source identification. In further study [6], they identified Japanese character with more features which include GLCM, DWT, Gaussian, LoG, Usharp, Wiener and Gabor features for classification. Additionally, Kee and Farid [7] proposed principal component analysis (PCA) and singular value decomposition (SVD) for printed characters to determine source printers. Furthermore, several studies [8, 9, and 10] conducted various techniques on intrinsic marks investigation for image document to classify laser printer.

The objective of this study is to obtain best performance for printer source identification by expanding feature extraction and diffusing feature selection. Therefore, this study is not only to encompass text document based source identification but also proposes a useful universal verification system which can be applied to image documents.

Accordingly, this paper is arranged as following: Section 2, encompasses the description about the theoretical background of feature extraction in different statistical approach and classification for source laser printer. Section 3 shows the justification of the proposed method and discussion. Section 4 concludes the paper.

II. THE APPROACH

The logical framework of printer forensic for text and image documents has been developed and the flowchart is shown in Fig. 1 with the following procedures:

Digitalizing documents: To start with, the text and image documents are printed and scanned where the files are identified after image cutting step i

剩余内容已隐藏，支付完成后下载完整资料

2017 IEEE第三届国际协作与互联网计算会议

印刷文件的来源识别

民仁仔，伊玛目余，金生银，于汉道

国立交通大学信息管理学院，新楚大学路1001号，台湾，中华人民共和国.

易兰加大学信息与图书馆学系，JL.Airlangga 4-6 Surabaya 60286，东爪哇，印度尼西亚mjtsai@cc.nctu.edu.tw

摘要:利用各种图像处理技术实现数字化的技术进步使得非法创建印刷文件成为可能。与此相对应，许多研究人员进行了如下研究：n决定该文件是否印刷伪造或原件。本研究从灰度共生矩阵(Glcm)、离散小波变换(Dw)两种方法研究了几种统计特征集。空间滤波器、维纳滤波器、Gabor滤波器、Haralick滤波器和分形滤波器利用支持向量机(SVM)和特征选择的决策融合来识别文本和图像文档。A型实验结果表明，图像文档的识别率比文本文档高。综上所述，所提出的方法优于以往的研究，具有一定的应用前景。可在印刷文件的实际取证中实现的技术。

关键词:取证，GLCM，离散小波变换(DWT)，支持向量机(SVM)，空间滤波器，维纳滤波器，Gabor滤波器，Haralick滤波器，分形滤波器

一.导言

数字取证与识别、收集、分析和检查数字证据以证明犯罪的发生有关的法律问题〔1〕。适当的技术和材料e需要准确和准确地识别法证对象。即使数字内容在现实生活中也被普遍接受，印刷文件仍然被广泛传播和接受。的挑战法医侦查领域仍在兴起，提供适当和充分的安全措施和工具〔2〕在法医学过程中协助法医侦查。

文档认证是使用一种经过验证的技术来验证文档正确性的过程。在由打印机产生的打印文档中，仍然需要开发才能获得最佳的w。为调查提供适用和充分的安全措施和工具。基本上，一个适当的方法将能够确定文档的来源，因为每个打印机都有独特的结构特征。来自打印机的模式信息可以用作这些设备的内部签名，因为打印机中有许多小的物理差异，例如s漂流电机和精密齿轮等。

这项工作得到了中华民国台湾国家科学委员会的部分资助，其中大多数是104-#number0#-H-009-020-My2和106-#number1#-H-009-022。从根本上说，打印机感光鼓的角速度波动与齿轮传动有关，而光波带的频率直接反映在它的机械特性上[3]。因此，每台打印机都具有基于打印页面上显色剂的相应波动的特征签名。

目前已采取了一些方法对印刷文件进行认证。Mikkilineni等人[4]运用GLCM对英语“e”字进行分析，形成特征向量。基于检查印刷文件中，制定了两种打印机识别策略。找到内部签名是识别特定打印机、模型的特性的第一个解决方案。和制造商的品牌，非常高的分辨率。另一种策略是在调制过程中，通过在文档中嵌入信息来检测外部签名。内在特征。Tsai et al。〔5〕实现基于GLCM和DWT的汉字识别特征提取和特征选择，得到打印机源IDE的最佳特征集；认证。在进一步的研究[6]中，他们识别出更多的日文特征，包括GLCM、DWT、高斯、LOG、USharp、Wiener和Gabor特征进行分类。此外，姬Farid[7]提出了印刷品主成分分析(PCA)和奇异值分解(SVD)两种方法。此外，还有几项研究[8、9和10]CTED图像文档内部标记研究的各种技术对激光打印机进行分类。

本研究的目的是通过扩展特征提取和扩散特征选择，获得最佳的打印机源识别性能。因此，本研究不仅是为了基于ASS文本文档的源识别，还提出了一种适用于图像文档的通用验证系统。因此，本文的结构安排如下：第二节介绍了特征提取的理论背景，并对不同的统计方法进行了分类。第三节说明了该方法的合理性和讨论。第四部分总结全文。

二、方法

已经制定了文本和图像文件打印机法医的逻辑框架，流程图如图1所示，程序如下：

数字化文档：首先，打印和扫描文本和图像文档，在图1中的图像切割步骤后文件被识别。然后，使用设置8位/像素和300 dpi分辨率由惠普扫描4050扫描仪。

图1识别激光打印机的程序

特征提取：采用九组不同的滤波器获取最丰富的分析值。我们分析了灰度文档，并使用follo进行了特征提取。机翼统计分解，如GLCM，DWT，空间滤波器，Gabor滤波器，Wiener滤波器，Haralick，和分形特征。

特征选择：本研究以决策理论模型为专家，特征集为替代。选择基于计数的融合算法作为决策融合的算法。五倍真实选择方法被实施以在执行分类之前获得最有计算效率和效率的特征。它们是PLUES-2-min -1（2M1），PLUES-3-MIUSU-2（P3M2），PLUS -4。-负-3(P4M3)、顺序前向浮动搜索(SFFS)和顺序向后浮动选择(SBFS)。

1.空间和分形特征

在空间域中，glcm是图像分析技术中的滤波器，它与纹理特征计算相结合，用于估计t中像素的二阶概率密度函数。他想象[4]。与先前的研究[5，6]一致，DWT子带将被用来描述分解后的打印机源。高斯(Log)的Laplacian，UnsharP，Wiener，Gabor和Haralick是本研究中使用的其他特征提取方法[11，12，13]。Costa[14]的基于分割的分形纹理分析(Sfta)也在im中实现。年龄分析与内容相似的纹理。

2.支持向量机(SVM)

将支持向量机的概念应用于打印文档的分类中，根据打印机的品牌和类型进行分类。简单地解释为试图找到最佳的超平面，它充当了I类之间的分隔符。n输入空间。通过测量余量超平面和寻找最大点（15）可以找到其中的超平面最佳分离。

TABLE I. MICROSCOPIC IMAGES FOR TEXT DOCUMENT

Printer Brand

Color Printing

Black Printing

LaserJet

Pro 300 Color

OKI C5950

TABLE II. MICROSCOPIC IMAGES FOR LENA

Lena image

Color Printing

Black Printing

HP LJP 300 Color

OKI C5950

图2黑色打印机设置下Lena的显微图像。

打印机品牌和型号为：(A)AvisionAM/MF 3000；(B)HP LaserJet Pro 200 Color P.M251nw；(C)LaserJet Pro 500 MFP M570 dn；(D)HP C。奥乐激光喷射CP 3525，(E)HP LaserJet Pro CP 1025(F)HP LaserJet 4300

三、实验结果

通过大量实验验证了本文提出的文本和图像文档源识别方法的有效性。实验分四个步骤进行：

1.预测试

在进行实验之前，必须事先执行一个程序，我们称之为预测试。预测试的目的是确定和验证纸张的性质是否会影响打印机源识别的识别精度。它还涉及打印机列表问题。包含颜色和黑色打印机的。

通过预测试实验，利用USB显微镜获取文本和图像文档的更详细的图形信息。预测试还分析了不同纸张的质量和纸张的选择。材料将被检查。我们研究了不同的白皮书、颜色和纹理。如表I所示，彩色打印机和黑色定位器之间的字符“E”没有显著差异。一般情况。相反，如表二所示，图像Lena之间存在明显的差异。但是，当图像以黑色打印机文档(如图2所示)打印时，很难用c来表示。介于颜色和黑色之间。

2.数据样本

为了验证和比较该文档，不仅对具有不同语言脚本的文本文档进行了区分，还对具有不同样本的图像文档进行了区分。目前，他们认为在扫描图像中查找要检查的文本和图像。本研究中使用的12种打印机的品牌和型号如表四所示。扫描后以bmp格式生成数字文档。宁。文本文档如表III所示，图像补丁样本如图3所示。

Fig. 3. Image patch samples for Wikipedia from 12 different printers

3.特征选择

为了在选择最重要的Ȝ特征的同时，在不损失精度的前提下，实现了自适应特征选择算法。NU基于247个特征的准确率来确定所选特征的Mber。所有特征滤波器来自glcm(22个特征)、dwt(12个特征)、高斯滤波器(21个特征)、log滤波器(21个特征)。特征)、不锐化滤波器(21个特征)、维纳滤波器(64个特征)、Gabor滤波器(48个特征)、Haralick滤波器(14个特征)和分形滤波器(24个特征)。因此，特征选择将be在下一步进行，以减轻计算需求。

特征选择过程采用了五种特征选择方法：P2M1、P3M2、P4M3、SFFS和SBFs。随机生成来自12个打印机源的10组图像。每组从每台打印机中选取500幅图像作为训练数据，另外300幅图像作为测试数据。针对需要选择的多个功能和运行时间问题，我们对此进行了探讨。PED将这些特征分为GLCM、DWT、高斯滤波器、LOG滤波器、非锐化滤波器、Wiener滤波器、Gabor滤波器、Haralick和分形特征等9个主要类别。执行过程中的选择顺序ON是用来选择最重要的特征。其结果是，在特征选择之后绘制出准确率与特征数的关系图。的凸性地块，最重要的Ȝ功能可以根据地块的最高比率来决定。由于空间有限，我们只列出Ȝ=165的特征值作为Ȝ最有效的特性。呃实验分析。

4.打印机来源识别

打印机取证的准确率基本上区分了打印机源识别的不同技术之间的性能行为。在第一阶段之后，这个阶段的实验通过使用不同的测试数据来比较使用所选特征集的性能来执行。如表五所示，图像文档的识别准确率高于文本DO调酒。当使用文本数据时，使用所有247个特征的实验结果平均为96.40%，使用165个特征的平均为97.68%。另一方面，图像docu的平均结果所有特征的综合利用率为99.60%，特征选择率为99.67%。这意味着图像文档的识别准确率优于文本文档来确定打印机bran。d和类型。

使用特征选择的识别性能优于不使用特征选择的方法。用差分法对不同方法的识别结果进行系统分析对字符“e”、“Nu;”、“҉”和“嗘”的分辨率较高的600 dpi进行了9种不同特征集的模拟，结果如表六所示。显然，所提出的方法达到了hi的目的。在仿真过程中的最高准确率。例如，使用165个特征对字符“Nu;”的准确率可达到99.82%。它比使用247个可以获得99.48%的特性更好。大为不同字符“e”、“҉”和“嗘”实现的选定滤波器可达到98%以上的精度。很明显，所提出的方法是在使用165个数据集之后进行特征选择。比其他特征集更好。

本文提出的打印机源模型识别方法中，图像文档要优于文本文档和图像文档。它还使用了功能se。采用决策融合的方法可以更有效地减少特征数目，提高识别结果。结果表明，本文提出的技术是一种很有前途的实现方法。在印刷文件的真实取证中。

5.已印发的有关讨论

本文所述方法的优点是：(1)利用文档图像对打印源进行调查，可以实现更好的分类；更多的特征将获得更高的准确率。当采用更多的特征时，计算和处理时间的复杂性也会增加。摘要说谎特征选择是一种有效的减少特征数量和获得更好特征的方法。（2）由于碳粉颗粒是由熔断器的热量专门熔化的，因此与纸张结合在一起。印刷过程中的机电不完美会导致印刷质量的提高。发展阶段的调色剂变体。粘贴在纸张文件上的打印机的这种内在纹理签名，包括形状、大小和图案，真实地提供了作为研究人员区分和分类打印机来源的指南。(3)对[5]中汉字的34个特征和日语汉字的209个特征进行了研究。在该研究中，我们进一步将特征空间从209扩展到247个特征，这提供了更好的实验结果;即使结果是有希望的，找到通用功能“IIt”仍是个未知数。

五.结论

本文提出了打印机分类中的文本和图像文档分析技术。基于SVM的特征选择AR分类方法及决策融合e受雇。从GLCM、DWT、空间滤波器、维纳滤波器、Gabor滤波器、Haralick和分形特征中系统地选择了Ȝ最重要的特征。识别准确率AchiEVES图像文档比文本文档更好地确定打印机的品牌和类型。该方法具有较好的性能，证明了该方法的有效性。

REFERENCES

J.A. Lewis, Forensic document examination: fundamentals and current trends, Oxford, Elsevier, 2014.
A.J. Marcella, F. Guillossou, Cyber Forensics: From Data t
剩余内容已隐藏，支付完成后下载完整资料

资料编号：[24298]，资料为PDF文档或Word文档，PDF文档可免费转换为Word

您需要先支付 30元 才能查看全部内容！立即支付

注册

找回密码

印刷文件的来源识别外文翻译资料

Source Identification for Printed Documents^*

I. INTRODUCTION

___________________________________________________

II. THE APPROACH

REFERENCES

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

印刷文件的来源识别外文翻译资料

Source Identification for Printed Documents*

I. INTRODUCTION

___________________________________________________

II. THE APPROACH

REFERENCES

您可能感兴趣的文章

最新文档

推荐栏目

Source Identification for Printed Documents^*