PHP Classes

PHP DOCX to Text: Extract text from Microsoft Word DOCX files

Recommend this page to a friend!
     
  Info   Example   View files Files   Install with Composer Install with Composer   Download Download   Reputation   Support forum   Blog    
Ratings Unique User Downloads Download Rankings
Not enough user ratingsTotal: 240 All time: 8,037 This week: 48Up
Version License PHP version Categories
docxtotext 1.0.2GNU General Publi...5PHP 5, Files and Folders, Text proces...
Description 

Author

This class can extract text from Microsoft Word DOCX files. It will work with PHP up to version 8.1.

It can take the path of a Microsoft Word file in DOCX format and extract its contents to save the text it contains, including the list and paragraph numbering, along with footnotes and endnotes and their reference numbers.

The class returns document text as an array with one element per paragraph.

Picture of Timothy Edwards
  Performance   Level  
Name: Timothy Edwards <contact>
Classes: 4 packages by
Country: United Kingdom
Age: ???
All time rank: 2943134 in United Kingdom
Week rank: 39 Up1 in United Kingdom Up
Innovation award
Innovation award
Nominee: 2x

Example

<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

</head>

<body>
<?php
require_once('wordtext.php');
$rt = new WordTEXT(false,'UTF-8');
$text = $rt->readDocument('sample.docx');

$det = explode(':',$text[0]);
echo
"No of text elements in the array - ".$det[0]."<br>";
echo
"Max length of a text element in the array - ".$det[1]."<br>&nbsp;<br>";
$LC = 1;
while (
$LC <= $det[0]){
    echo
"Element ".$LC." : ".$text[$LC]."<br>";
   
$LC++;
}

?>
</body>


Details

A php class to extract all the text from a Word DOCX document and to output it as a text array

Description

This php class will take a DOCX type Word document and extract all the text from it. The text will include all list and paragraph numbering and also footnotes and endnotes together with their reference numbers. The text will outputted as an array, one array element per paragraph. This will make it easy to search or manipulate the text or to save it to a database. For convenience the first element [0] of the array contains the number of text array elements and the length of the longest element in the format 'Number:Length'. In normal mode the class produces no output to the screen.

A demonstration file 'textdemo.php' is included. This expects the Word docx file to be called 'sample.docx'. The demonstration file will display on screen the resultant text array, giving the number of text elements, the length of the longest one and then all the text extracted from the document along with its array element number.

USAGE

Include the class in your php script

require_once('wordtext.php');

Normal mode to save all the the text to an array (no output to screen)

$rt = new WordTEXT(false); or $rt = new WordTEXT();

Debug mode to display on screen the associated DOCX XML files and the text extracted from the document

$rt = new WordTEXT(true);

Set output encoding (Default is ISO-8859-1)

Will alter the encoding of the resultant text - eg. 'UTF-8', 'windows-1252', etc.

$rt = new WordTEXT(false, 'desired encoding');

Read docx file and output all the text as an array

$text = $rt->readDocument('FILENAME');

Update Notes

Version 1.0.2 - Clearance of some bugs which prevented the script working with some dosc files. Also clearance of php warning messages

Version 1.0.1 - Updated to now work up to at least PHP 8.1

Version 1.0.0 - Original version


  Files folder image Files (4)  
File Role Description
Accessible without login Plain text file LICENSE Lic. License text
Accessible without login Plain text file README.md Doc. Documentation
Accessible without login Plain text file textdemo.php Example Example script
Plain text file wordtext.php Class Class source

The PHP Classes site has supported package installation using the Composer tool since 2013, as you may verify by reading this instructions page.
Install with Composer Install with Composer
 Version Control Unique User Downloads Download Rankings  
 100%
Total:240
This week:0
All time:8,037
This week:48Up