Tracking Forums, Newsgroups, Maling Lists
Home Scripts Tutorials Tracker Forums
 
  HOME    TRACKER    PHP




Extract Only Plain Text From A Page


Basically, what I am trying to do is write some PHP code that will automatically take text from any web page and eliminate all the HTML, CSS, and JS codes and formatting, leaving only the plain text from the page. I got my code started, but I have hit a snag with javascript and css codes. This is what I have so far:

<?php
$geturl = $_GET["url"];
ob_start();
include($geturl);
$page = ob_get_contents();
ob_end_clean();
$output = ereg_replace('<script.*.</script>', ' ', $page);
$output2 = ereg_replace('<style.*.</style>', ' ', $output);
$plaintext = strip_tags($output2);
echo $plaintext;
?>

The strip_tags function automatically removes all html tags, but it doesn't do anything to javascript and css because html code is not provided between the beginning and end tags, whereas javascript and css codes are both contained within two separate tags, like this for more clarification:

html:
<div name="htmltag">Keep this text here</div>

javascript:
<script>function somejs() {remove all this code}</script>

As you can see, the text between the div tags should stay, but the js between the script tags should be removed because it is code.

I then tried the ereg_replace function to get rid of js and css codes, but there is a problem when there is more than 1 piece of js or css code. The wildcard value (.*.) skips over any ending script or style tags until it reaches the last ending tag, therefore deleting all the text between the two pieces of code. Example:

<SCRIPT>function somejs() {remove all this code}</script> //removes all text and code from beginning here
KEEP ALL THIS TEXT HERE
<script>function somejs() {remove all this code}</SCRIPT> //to end here

Now finally down to the question, is there any way to only remove the js and css code between the beginning tag and the immediate next ending tag? Or is there any other way to get rid of the javascript and css codes?




View Complete Forum Thread with Replies

See Related Forum Messages: Follow the Links Below to View Complete Thread
Can`t Output To Text/plain
I`d like to show my MySql query results in a plain text style. So inside my php file I wrote:

Plain Text Email
I'm wanting to protect all inputs for sending a plain text email, in a common
routine.

Have just found POSIX [:print:] which I thought looked useful.
I didn't want to use htmlentities(); because it's a plain text email.

Would this protect me from anyone sending spam though this?

$raw = stripslashes($raw);
$raw = preg_replace("/(content-type|bcc:|cc:|onload|onclick)/i", "DELETED",
$raw);
$raw = strip_tags($raw);
$raw = preg_replace("/[^[:print:]]/", " ", $raw);
$raw = substr($raw, 0, 500);
$raw = trim($raw);

Or, should I use:
$raw = htmlentities($raw, ENT_NOQUOTES);

The email address would obviously be different.
This would cover just the name, subject and message.
I don't need newlines etc.

Plain Text Database
i'm really a newbie to php but not OOP.

i'm designing a database to hold simple text messages to display in a
page called, "News". The client doesn't want a sql database so I
suggested a plain text database. I have it working but when I pull the
data (fopen) it all comes back as one line.

It's set up as a simple form passing 2 variable, $title and $comments.
They both write (fwrite) just fine to the .txt file but upon
retreiving them (fopen) it's all one line. Since I can't pass formated
text to a .txt file is there a different way?

As a newbie I haven't come across a solution yet. The client wants
this soon so I'm asking here due to the timeline. Given a few more
weeks I'm sure I'd stumble across it in some text.

RTF To Plain Text Conversion
does anyone know of a good PHP "module" -- or something else that I can invoke from a PHP script -- that will perform a simple conversion from Rich Text Format (RTF) to plain text with line breaks? I want to store some data in a MySQL database in RTF and allow users to preview the data as unformatted text (except for line breaks/paragraphs) on a webpage before deciding whether to download a file containing the RTF data. I'd rather not try to hack something out myself if I don't have to. The RTF files are likely to be created with Microsoft Word.

Convert MS Word / Rtf / ... To Plain Text
i'm looking for standalone libraries that convert documents to plain text so i can let people edit the text in a textarea after uploading. One thing to notice is that i can not use COM because i can't configure the webserver.

Does anyone has interesting classes that are able to do this. I found a PHP class for ms word documents at http://obninsk.name/obninsk_doc/ but that doesn't work at all for my word documents.

Mail() Plain Text Vs. Html Format
I have been testing the mail() code below using MS Outlook and Outlook Express and a hotmail account and the details sent are always in "plain text" format, which results in the information being nicely aligned (incidentally the e-mail contains order confirmation with lots of columns).

However, my customer came back to me this morning to tell me that all is not well ! And rightly enough, when I looked at the snapshot he sent me he is receiving it in "html" format. What am I doing wrong ? Keeping in mind that I am a PHP greenhorn ... Can anyone help. Thanks in advance !

$headers = "From: info@somecompany.com
";
$headers.= "X-Sender: <info@somecompany.com>
";
$headers.= "X-Mailer: PHP
";
$headers.= "X-Priority: 1
";
$headers.= "Return-Path: "."<info@somecompany.com>
";
$headers.= "cc: info@anothercompany.com
";
$headers.= "bcc: me@mycompany.com
";
$headers.= "MIME-Version: 1.0
";
$headers.= "Content-type: text/plain; charset=iso-8859-1
";

if(@mail($to,$re,$msg,$headers))
{
// tell them all was sent fine
}
else
{
// give an error message
}

Inserting/parsing Plain Text With 'require'?
I am trying to setup a very simple site that will pull text files into an existing template. I am using a simple require
statement, such as:

<?php
require "/www/companyname/body.txt"
?>

The first problem is that it does not seem to respect the linefeeds, which are saved in Unix format, and just lists it as one
massive block of text. The second problem is that, obviously, it does not convert symbols such as '&' to '&'.

The reason behind this way of including text into HTML files is so that the lecturers can write articles without having to
deal with HTML and the articles are inserted into the HTML templates with the 'require' statement. Also, the shear
number of text documents that need to be posted would cause a lot of work. I have looked at Project Midguard, but I
tend to shy away from applications with little documentation, even though that would be absolutely ideal.

Sending Both HTML And Plain Text Email.
I am using php to send weekly newsletters to my mysql database, the emails are always in HTML only.

I was wondering if anyone knew how to send both types so that if they can't view HTML emails it will show just text?

Plain Text Email Spacing Issues?
PHP Code:

// create final message, $text refers to the textarea they typed the original message in $message2 = "Dear $firstname,

$text

Regards,
The Team......

Application/octet-stream Vs Text/plain When Uploading
when testing Zend's file upload script, i uploaded a file (sql.txt that was a sql backup) and $_FILES reported it as text/plain as it should. as soon as i renamed it to sql.sql, $_FILES now reports it as application/octet-stream. all i did was rename the file. to make matters worse, i thought Windows XPpro was adding some bits to the file to explain $_FILES new type so i renamed sql.sql to sql.exe; $_FILES now says it is text/plain, even with the .exe extention.

does anyone know why adding the .sql extention would change the type from text/plain to application/octet-stream?

btw, the webhost is a linux box (RH).

more tested extentions (renaming sql.txt to the following extentions)
.sql - application/octet-stream
.php - application/octet-stream
.gz - application/octet-stream
.tar - application/octet-stream
.htm - text/plain
.html - text/plain
.txt - text/plain
.exe - text/plain
.gif - text/plain
.jpg - text/plain
.asp - text/asp
.rpm - audio/x-pn-realaudio-plugin
.wav - audio/x-wav
.mp3 - audio/mpeg

(all i'm doing is changing the extention, nothing more. opening the file in notepad looks all ascii, no funky characters)

Phpmailer.class Messages Are Been Converted To Plain Text...
I am using a phpmailer class to send some staff over the email...

I am tring to send it with text/html but for some reason the email are
been converted to plain and all the headers are shown, here is the
email...

X-Tour4Less.co.il Mailer:
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="b1_27976937fb6a931b3ed2d40aebd76a26"

--b1_27976937fb6a931b3ed2d40aebd76a26
Content-Type: text/plain; charset = "windows-1255"
Content-Transfer-Encoding: 8bit

*יסיון עברית

--b1_27976937fb6a931b3ed2d40aebd76a26
Content-Type: text/html; charset = "windows-1255"
Content-Transfer-Encoding: 8bit

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"<html>
<head>

<META HTTP-EQUIV="Content-Type" content="text/html;
charset=windows-1255"</head<body<p dir=RTL><span
lang=HE>&#1513;&#1500;&#1493;&#1501;&nbsp;&#1495;&#1489;&#1512;&#1514;
Tour4less contact .</span></p<p dir=RTL><span
lang=HE>&#1502;&#1513;&#1514;&#1502;&#1513; &#1513;&#1500;
&#1492;&#1488;&#1514;&#1512; &#1513;&#1500;&#1504;&#1493; </span><span
dir=LTR>TOUR4LESS.CO.IL</span><span dir=RTL></span><span lang=HE><span
dir=RTL></span&#1492;&#1514;&#1506;&#1504;&#1497;&#1497;&#1503;
&#1489;&#1497;&#1510;&#1497;&#1512;&#1514; &#1511;&#1513;&#1512;
&#1488;&#1497;&#1514;&#1493; &#1506;&quot;&#1497;.........

How To Send A Plain Text Version Of An Email With Html
how can u send a plain text version of an email with the html so that
the users mail client can access this plain text version?

Sending Plain Text E-mail, Trying To Track Accesses
i am using php to dispatch from time to time using the mail() function. i have the message split into an html form and a plain text form, only one displays depending on the recipients mail client. my html message includes a 1x1 "image" that is really a php script, which allows me to track reads on the html version... but i don't know of a way to track reads/views/accesses/etc on the plain text version. is this possible?

Best Way To Extract Text From Website
I am trying to extract the text from about 1500 pages so I can dump the info into a database, but I would like to be able to keep the <br> and formatting around the text, but strip out all of the rest of the pages code.

Extract Text From Word Documents
Is there a way to extract the text of a word document with php? And perhaps some of the formatting (like break lines, bold, italic,...)?

Mime_content_type() For PNG Image Returns "text/plain"
PHP 4.3.8 with UNIX with option --with-magic_mime

Code: ( php )

How I Can Extract A Text From Image(jpeg, Tiff,etc..) ?
How I can extract a text from image : jpeg, tiff, bmp, etc.. using PHP
is possible ? What class i must use ?

PHP4 : Extract Text From HTML File
I would like to extract the text in an HTML file
For the moment, I'm trying to get all text between <tdand </td>. I
used a regular expression because i don't know the "format between
<tdand </td>

It can be :
<tdtext1 </td>
or
<td>
text1
</td>
or anything else

eregi("<td(.*)>(.*)(</td>?)",$text,$regtext);

The problem is that, if I have
<tdtext</td>
<td>text2</td>

regtext will return text</td><td>text2.

How can I change the expression so that it stops at the first occurence
of </td>?


How To Extract An Email-address From A Text File
Can somebody show me a quick code snippet to reliably extract an
email-address form a text file ?

Q: Extract Only Text Lines Visible In Webbrowser?
I need to web query for further processing, f.ex

http://moneycentral.msn.com/investo...YMBOL=F,MSFT,DE

I use

<?php
$MSN="http://moneycentral.msn.com/investor/external/excel/quotes.asp?SYMBOL=";
$QSymbols="F,MSFT,DE";
$QStr="$MSN"."$QSymbols";
$lines = file ($QStr);
?>

However, the relevant information is only 3 lines on the webpage, but
more than 1000 lines in $lines, so it is slow and tedious to work
with. Is there any way to extract only the lines visible in the
browser?

Extract Data From A Web Page?
with PHP, can somebody direct me or give me some insight on how to go about extracting data from a web page.

Say I want to pull sports statistics from a page.. how is this done?

Extract Specific Parts Of A Page
I have a js script that runs from a remote location, it displays the latest lottery results, but there are ad's on it, i would like to extract the gif/jpegs from it, can it be done and if it can how??

Php Script To Filter A Text File And Extract Lines Starting With Keyword?
For a class, students are going to run an experiment on line. Each time
a subject runs, his/her data is appended to one giant text file. Their
own data set will be just one line starting with the keyword they gave
as identification.

The faculty does not want the students to be able to download and see
the giant data file. He wants the students to only download and see the
data that starts with their own identification tag.

in unix, filtering a file to keep only the line starting with code MCB
would look something like
tail -f your_file_name | grep MCB
from what I read.

Given the concerns the faculty has for protecting the database, what do
I need to look into to write a php script that would access the data
file, but only show a web page with the data corresponding to the
students identification code?

How To Extract A Page Title From An HTML File
I am trying to extract the page title, description and keywords from an HTML page. Description and Keywords are easy, using get_meta_tags().

Getting The Text From A Web Page
I need to take a web page and parse the text from the site. I am thinking of using regex to kill the script and style sections, then strip tags or more regex to kill the reset of the non-text. I need to have the text categorized as sentences, so I some how have to keep track of how it is grouped with other text.

I have two questions:

1) What do you guys think is the best way to do this? Regex, explodes, iteration, building a DOM structure, callbacks?

2) What would the regex look like if I wanted to find <, then any number of whitespace chatarers, then the word script, then anything except for (<, possible whitespace, /script, and then anything until >). Put another way, how can I match the style section of a document?

Plain English GD Installation?
I finallly have php3, apache and MySQL installed and running. I would like to install or get GD working.

I am on Win98, php3, apache 1.3.9 and downloaded GD 1.8.
I looked over the readme, but does'nt make a lot of sense to me. Can anyone point me in the direction for, say a "GD install for dummies"?

Q: Read Text With VBA From A PHP Web-page
I created a page on our intranet that shows a number and that
increases for every time the page is opened. It is similar to a
visitors-counter.
When I look at the page with Internet Explorer it works just fine.

Now I want to read this web-page from a MS-Word macro and include the
number as a company wide unique id in my MS-Word document.
Unfortunately, the PHP script doesn't update the counter when I call
it from my MS-Word macro.

How can I force PHP to update my counter when I call it from a VBA
macro?

I am using the following code:

Finding Text Within A Page
Basically I want to search a page for words and if they are found I want the script to do one thing. If they're not found, I want it to do something else.

Plain PHP Implementation Of Hash Function
I have a problem compiling the hash function from PECL into my PHP.

I get the error configure: error: C preprocessor "/lib/cpp" fails
sanity check

I would like to use a plain PHP implementation of these functions.

Is there a library of them around?

Page Breaks In Text Reports
How can I make page breaks in .txt reports? Is this even possible with php?

How To Place Formatted Text On The Php Page
I'm entering data through textarea, and I want to display the same data on the
form with all formatting settings. means, if I pressed enter while entering text in the textarea to make a separate line, the separate line has to come when displaying on the form also.

Changing Text On Page Via Dropdown Box.
I have a webpage that draws various text decriptions from a mySql database. The description displayed on the page is controlled via a dropdown box. Currently I am using Javascript with a dropdown box (onChange handler) to change the value of a textbox.

Can anybody think of a way that I don't have to use a textbox(with its' scrollbars and background)? How can I change the value of a php variable by changing the selected index of a dropdown box?

Display A Text File On A Web Page
I have a text file that I am trying to display on a web page. If I cat
or more the file it formats and displays fine. When it comes up in the
browser it seems to loose tabs and the format gets messed up. This is
how I display the file.

$show = file("./fields/combined/$cdp$store");
$arrayitems = sizeof($show);
$x=0;
while ($x < $arrayitems) {
print("$show[$x]
");
$x++;
}

If I edit the file it has ^M at the end of each line if it matters.
Does anyone have a better idea as to how to display it?

Displaying Text While Page Is Loading
i'm trying to display text while a page is loading using a method
similar to the following:

<?
ob_end_flush();
echo 'AAA<br>'
flush();
sleep(10);
echo 'BBB'
?>

in this script, AAA and BBB appear at the same time - when the page has
fully loaded - which is not what i want (i want AAA to appear and then
10 seconds later, BBB to appear). pursuant to the suggestions on
php.net's entry for flush, i've also tried the following to no avail:

<?
echo 'AAA<br>'
ob_flush();
flush();
sleep(10);
echo 'BBB'
?>



Text File Download On .php Page
I have created a website in which I want to put a link to download a text
file. When I used simple:

<a href="Dir/File.txt">Download file</a>

I had such problem that instead of dowloading the file contents was
displayed in browser. In some book on PHP5 I found a "solution": create a...

Take A Text Which Is Highlighted On A Web Page As A String
how can i use php to take a text which is highlighted on a web page as a string?

Phpmailer Class Converted To Plain On Some Servers...
I am using a phpmailer class to send some forms over the email...

And the problem is, that some ppl (especially problematic for me is the
buyer....) getting the email as rough data (sorce...) here is the emal
itself as they get it (the headers are below...)

Code:

X-Tour4Less.co.il Mailer:
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="b1_27976937fb6a931b3ed2d40aebd76a26"

--b1_27976937fb6a931b3ed2d40aebd76a26
Content-Type: text/plain; charset = "windows-1255"
Content-Transfer-Encoding: 8bit

рйсйеп тбшйъ

--b1_27976937fb6a931b3ed2d40aebd76a26
Content-Type: text/html; charset = "windows-1255"
Content-Transfer-Encoding: 8bit

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"<html>
<head>

and the rest of the email...

and here is the headers...

Code:

ESMTP; 01 Sep 2006 12:11:23 -0000
Received: (qmail 32712 invoked from network); 1 Sep 2006 05:11:23 -0700

Received: from localhost (HELO http://www.tour4less.co.il) (127.0.0.1)
by localhost with SMTP; 1 Sep 2006 05:11:23 -0700
Received: from phpmailer ([88.153.9.8])
by http://www.tour4less.co.il with HTTP (PHPMailer);
Fri, 1 Sep 2006 05:11:23 -0700
Date: Fri, 1 Sep 2006 05:11:23 -0700
To: undisclosed-recipients:;
From: "Tour4less.co.il" <########## // here was an email i delited...

Subject: Contact Email from Tour4Less.co.il
Message-ID: <27976937fb6a931b3ed2d40aebd76a26@www.tour4less.co. il>
X-Priority: 3
X-Mailer: PHPMailer [version 1.71]
X-Virus-Scanned: amavisd-new at sce.ac.il
Return-Path: ############## // here was an email i delited...
X-OriginalArrivalTime: 01 Sep 2006 12:11:02.0551 (UTC)
FILETIME=[B168A270:01C6CDBF]

Creating Text File With Page Breaks
If is a newline and is a carraige return, what special character is used for a page break?

I Want To Create Web Page Acting As Text Editor?
What I am trying to do is to create a tutorial for my beginning students for JavaScript and PHP, using a very simple online editor like the one @ w3schools, So the left window will be the text file code, such as:

<html>
<body>
<script>document.write("This is a test.");</script>
</body>
</html>...

Getting Query Values Into Text Box, Then Pass Them To Another Page
how do i get the values from a query from a db into text box's, in a form, so i can hit submit and it goes to an "update" page? Code:

Paste HTML Page Inside Text Area
i want to create a newsletter or emailer form sender. and that form has a text area. can be an ordinary text area or can also be tinymce.

i want to have the text area in such a way that an HTML page can be pasted in the text area (let say the html page is jobpost.html) and send it to the email address that i specified.

How To Populate Data Into Text Fields On Page Load??
I am creating an account management page that shows the user's contact information and I would like to populate the textfields so that the user doesn't have to type in all his information when he wants to change something. I want to have him type in only the information he wants to change or update.

I already have the data displayed on the page. But now I would like to take the data and fill in the textfields with it.

Clearing The Input Text Field After Submiting The Form To A New Page
I have a page with a form with an input text field ... and when a user types his/hers ID ... and submits the form - his/hers page opens in a NEW window .... but the value (ID) that was written in the original window remains.

I want to make sure that the form input field on the original page becomes "invisible" - by refreshing/reloading the original page or just by clearing the input form after submission.

For now I have put in a meta refresh tag in the head of the html file ... but that is not the solution ... becouse I only need a one time refresh/reload of the page ... and this is to happen right after the input text value is being submitted.

How To Change Template Menu Link Text Color For Page Currently Viewed ??
Say you have a left menu that stays the same on every page within in your site. Obviously you would make this menu a template or library item, so that when you edit one version, all the others are updated.

Let's say the left menu is the following:

Page_1
Page_2
Page_3

Now, when I am viewing Page_1, I want the "Page_1" text in the left menu to be red, so that the viewer knows what page they are on. When I click on the left menu link for "Page_2", then I want the "Page_1" text to go back to black and the "Page_2" text to be red.

Write Text To An "image-page"
If I generate an image, is it possible to write text on the same page then? If it isn't so, there's no special meaning to generate images, who wants to surf on a page that is out of content, right?

<?
Header("Content-Type: image/gif");
$img = ImageCreate(100, 100);
$black = ImageColorAllocate($im, 0, 0, 0);
ImageFill($img, 100, 100, $black);
ImageGIF($img);
?>

Extract Tag Value
I have a 3GB XML file thatI need to parse. since the file is too big to read all at once in to an array, I am reading it one line at a time. I need help on how to extract the contents of some tags, say name, address etc. Here is the code: ....

Extract From String
Its saturday morning and my mind has gone blank! How do i extract something from a string between two tags?

For example,

$line = "Hello <1234567> This is test";

I want to get everything thing inbetween the '<' & '>' tags into a variable. So the $var= 1234567;

Can someone please point me in the right direction or tell me what to search for!

Extract Function
I am now working on a method to extract the unique record
IDs from the search query output. I am doing this because what I would
like to do is build a hyperlink to the details page. I am going to
write a separe query to pull up the record details. I was wondering if
I could pull this information out using the Extract function?


Copyright 2005-08 www.BigResource.com, All rights reserved