Reading the "clean" text from RTF in PHP

View: 488    Dowload: 0   Comment: 0   Post by: hanhga  
Author: none   Category: Php&mySql   Fields: Other

9 point/2 review File has been tested

Rich Text Format (often abbreviated as RTF), to surprise of many, is quite complex text data format. During its long history RTF bought a lot of add-ons that disturb the process of getting "clean" text. Let's try to solve that

Introduction

Rich Text Format (often abbreviated as RTF), to surprise of many, is quite complex text data format. During its long history RTF bought a lot of add-ons that disturb the process of getting "clean" text. Let's try to solve that...

A little theory

At first let’s look into the RTF file.


 

An RTF file consists of unformatted text, control words and groups.

control word is a specially formatted command that RTF uses to mark printer control codes and information that applications use to manage documents. A control word is made up of lowercase alphabetic characters between "a" and "z". Each control word begins with a backslash () and ends with one of the following:

  • A space. In this case, the space is part of the control word.
  • A numeric parameter that can be a positive or a negative number.
  • Any character other than a letter or a digit.

Therefore, the character string  tf1ansiansicpg1251 easily can be divided into three control words: rtf with parameter 1 (the major format version), ansi (the current encoding) and ansicpg with parameter 1251 (the current code page number 1251).

group consists of text and control words enclosed in braces ({}). Control words defined within a group affect only the text inside this group and all nested subgroups. In order to know which control words are active now we will use the control words stack. When reading the opening brace ({) we will add new array stack element and write the data from previous stack element to it. When reading the closing brace (}) – remove top stack element.

Also we need to mention that some control words may be turned off not by closing the group but adding parameter 0 to the control word. For example, strings This is { bold} text and This is  bold 0 text give us the same result This is bold text.

Now we can come to the conclusion that all characters in an RTF file that are not control words or braces are plain text.

How to get "clean" text?

Even if we already know how to distinguish plain text from the control words, we need to discuss the characters encoding question.

RTF is an 8-bit format. That would limit it to ASCII, but RTF can encode characters beyond ASCII by escape sequences. The character escapes are of two types: code page escapes and Unicode escapes.

In a code page escape, two hexadecimal digits following an apostrophe ('hh) are used for denoting a character taken from a Windows code page. The current code page is specified by control word ansicpg. For example, if /ansicpg1256 is present, the sequence 'c8 will encode the Arabic letter beh (ب).

If a Unicode escape is required, the control word u is used, followed by a 16-bit signed decimal integer giving the Unicode codepoint number. For the benefit of programs without Unicode support, this must be followed by the nearest representation of this character in the specified code page. For example, u1576?would give the Arabic letter beh, specifying that older programs which do not have Unicode support should render it as a question mark instead.

The control word uc0 can be used to indicate that subsequent Unicode escape sequences within the current group do not specify a substitution character.

Let’s read!

Now we have enough theoretical knowledge to start reading our first .rtf files:

<?php

// Function that checks whether the data are the on-screen text.
// It works in the following way:
// an array arrfailAt stores the control words for the current state of the stack, which show that
// input data are something else than plain text.
// For example, there may be a description of font or color palette etc. 
function rtf_isPlainText($s) {
    $arrfailAt = array("*", "fonttbl", "colortbl", "datastore", "themedata");
    for ($i = 0; $i < count($arrfailAt); $i++)
        if (!empty($s[$arrfailAt[$i]])) return false;
    return true;
} 

function rtf2text($filename) {
    // Read the data from the input file.
    $text = file_get_contents($filename);
    if (!strlen($text))
        return "";

    // Create empty stack array.
    $document = "";
    $stack = array();
    $j = -1;
    // Read the data character-by- character…
    for ($i = 0, $len = strlen($text); $i = 'a' && $nc = 'A' && $nc <= 'Z') {
                    $word = "";
                    $param = null;

                    // Start reading characters after the backslash.
                    for ($k = $i + 1, $m = 0; $k = 'a' && $nc = 'A' && $nc = '0' && $nc  0)
                                $i += $ucDelta;
                        break;
                        // Select line feeds, spaces and tabs.
                        case "par": case "page": case "column": case "line": case "lbr":
                            $toText .= "
"; 
                        break;
                        case "emspace": case "enspace": case "qmspace":
                            $toText .= " "; 
                        break;
                        case "tab": $toText .= "	"; break;
                        // Add current date and time instead of corresponding labels.
                        case "chdate": $toText .= date("m.d.Y"); break;
                        case "chdpl": $toText .= date("l, j F Y"); break;
                        case "chdpa": $toText .= date("D, j M Y"); break;
                        case "chtime": $toText .= date("H:i:s"); break;
                        // Replace some reserved characters to their html analogs.
                        case "emdash": $toText .= html_entity_decode("—"); break;
                        case "endash": $toText .= html_entity_decode("–"); break;
                        case "bullet": $toText .= html_entity_decode("•"); break;
                        case "lquote": $toText .= html_entity_decode("‘"); break;
                        case "rquote": $toText .= html_entity_decode("’"); break;
                        case "ldblquote": $toText .= html_entity_decode("«"); break;
                        case "rdblquote": $toText .= html_entity_decode("»"); break;
                        // Add all other to the control words stack. If a control word
                        // does not include parameters, set ¶m to true.
                        default:
                            $stack[$j][strtolower($word)] = empty($param) ? true : $param;
                        break;
                    }
                    // Add data to the output stream if required.
                    if (rtf_isPlainText($stack[$j]))
                        $document .= $toText;
                }

                $i++;
            break;
            // If we read the opening brace {, then new subgroup starts and we add
            // new array stack element and write the data from previous stack element to it.
            case "{":
                array_push($stack, $stack[$j++]);
            break;
            // If we read the closing brace }, then we reach the end of subgroup and should remove 
            // the last stack element.
            case "}":
                array_pop($stack);
                $j--;
            break;
            // Skip “trash”.
            case '': case '
': case 'f': case '
': break;
            // Add other data to the output stream if required.
            default:
                if (rtf_isPlainText($stack[$j]))
                    $document .= $c;
            break;
        }
    }
    // Return result.
    return $document;
}
?>

 

Reading the "clean" text from RTF in PHP

Reading the "clean" text from RTF in PHP Posted on 17-12-2015  Rich Text Format (often abbreviated as RTF), to surprise of many, is quite complex text data format. During its long history RTF bought a lot of add-ons that disturb the process of getting "clean" text. Let's try to solve that 4.5/10 488

Comment:

To comment you must be logged in members.

Files with category

  • How to Picking the Brains of Your Customers with Microsoft’s Text Analytics

    View: 3942    Download: 0   Comment: 0   Author: none  

    How to Picking the Brains of Your Customers with Microsoft’s Text Analytics

    Category: Php&mySql
    Fields: Other

    2.5/2 review
    With the explosion of machine learning services in recent years, it has become easier than ever for developers to create “smart apps”. In this article, I’ll introduce you to Microsoft’s offering for providing machine-learning capabilities to apps.

  • How to MySqli Tutorial PHP MySqli Extension

    View: 366    Download: 0   Comment: 0   Author: none  

    How to MySqli Tutorial PHP MySqli Extension

    Category: Php&mySql
    Fields: Other

    0/0 review
    PHP provides three api to connect mysql Database.

  • Make Laravel Artisan Commands

    View: 339    Download: 0   Comment: 0   Author: none  

    Make Laravel Artisan Commands

    Category: Php&mySql
    Fields: Other

    0/0 review
    Artisan is the command line tool used in Laravel framework. It offers a bunch of useful command that can help you develop application quickly. Apart from Artisan available commands, you can create your own custom commands to improve your workflow.

  • Check if a Number is a Power of 2

    View: 320    Download: 0   Comment: 0   Author: none  

    Check if a Number is a Power of 2

    Category: Php&mySql
    Fields: Other

    1.5/3 review
    How to check if a number is a power of 2. To understand this question, let’s take some example.

  • Concatenate columns in MySql

    View: 383    Download: 0   Comment: 0   Author: none  

    Concatenate columns in MySql

    Category: Php&mySql
    Fields: Other

    0/2 review
    Artisan is the command line tool used in Laravel framework. It offers a bunch of useful command that can help you develop application quickly. Apart from Artisan available commands, you can create your own custom commands to improve your workflow

  • How to Query NULL Value in MySql

    View: 319    Download: 0   Comment: 0   Author: none  

    How to Query NULL Value in MySql

    Category: Php&mySql
    Fields: Other

    5/1 review
    Misunderstanding NULL is common mistake beginners do while writing MySql query. While quering in MySql they compare column name with NULL. In MySql NULL is nothing or in simple word it isUnknown Value so if you use comparison operator for NULL values...

  • How to Abstract Class in PHP

    View: 356    Download: 0   Comment: 0   Author: none  

    How to Abstract Class in PHP

    Category: Php&mySql
    Fields: Other

    0/0 review
    What is an abstract class in PHP and when to use an abstract class in your application. In this tutorial, we’ll learn about abstract class and their implementation.

  • Use Enums in Rails for Mapped Values

    View: 318    Download: 0   Comment: 0   Author: none  

    Use Enums in Rails for Mapped Values

    Category: Php&mySql
    Fields: Other

    2.5/2 review
    When I worked in a call center, we used to mark cases with different statuses. This allowed upper management to get a handle on where cases stood, what the bottlenecks were and flow of calls. Thankfully it has been a long time since I worked in a...

 

File suggestion for you

File top downloads

logo codetitle
Codetitle.com - library source code to share, download the file to the community
Copyright © 2015. All rights reserved. codetitle.com Develope by Vinagon .Ltd