Extract Only Plain Text From A Page
Basically, what I am trying to do is write some PHP code that will automatically take text from any web page and eliminate all the HTML, CSS, and JS codes and formatting, leaving only the plain text from the page. I got my code started, but I have hit a snag with javascript and css codes. This is what I have so far:
<?php
$geturl = $_GET["url"];
ob_start();
include($geturl);
$page = ob_get_contents();
ob_end_clean();
$output = ereg_replace('<script.*.</script>', ' ', $page);
$output2 = ereg_replace('<style.*.</style>', ' ', $output);
$plaintext = strip_tags($output2);
echo $plaintext;
?>
The strip_tags function automatically removes all html tags, but it doesn't do anything to javascript and css because html code is not provided between the beginning and end tags, whereas javascript and css codes are both contained within two separate tags, like this for more clarification:
html:
<div name="htmltag">Keep this text here</div>
javascript:
<script>function somejs() {remove all this code}</script>
As you can see, the text between the div tags should stay, but the js between the script tags should be removed because it is code.
I then tried the ereg_replace function to get rid of js and css codes, but there is a problem when there is more than 1 piece of js or css code. The wildcard value (.*.) skips over any ending script or style tags until it reaches the last ending tag, therefore deleting all the text between the two pieces of code. Example:
<SCRIPT>function somejs() {remove all this code}</script> //removes all text and code from beginning here
KEEP ALL THIS TEXT HERE
<script>function somejs() {remove all this code}</SCRIPT> //to end here
Now finally down to the question, is there any way to only remove the js and css code between the beginning tag and the immediate next ending tag? Or is there any other way to get rid of the javascript and css codes?
View Complete Forum Thread with Replies
See Related Forum Messages: Follow the Links Below to View Complete Thread
Plain Text Email
I'm wanting to protect all inputs for sending a plain text email, in a common routine. Have just found POSIX [:print:] which I thought looked useful. I didn't want to use htmlentities(); because it's a plain text email. Would this protect me from anyone sending spam though this? $raw = stripslashes($raw); $raw = preg_replace("/(content-type|bcc:|cc:|onload|onclick)/i", "DELETED", $raw); $raw = strip_tags($raw); $raw = preg_replace("/[^[:print:]]/", " ", $raw); $raw = substr($raw, 0, 500); $raw = trim($raw); Or, should I use: $raw = htmlentities($raw, ENT_NOQUOTES); The email address would obviously be different. This would cover just the name, subject and message. I don't need newlines etc.
Plain Text Database
i'm really a newbie to php but not OOP. i'm designing a database to hold simple text messages to display in a page called, "News". The client doesn't want a sql database so I suggested a plain text database. I have it working but when I pull the data (fopen) it all comes back as one line. It's set up as a simple form passing 2 variable, $title and $comments. They both write (fwrite) just fine to the .txt file but upon retreiving them (fopen) it's all one line. Since I can't pass formated text to a .txt file is there a different way? As a newbie I haven't come across a solution yet. The client wants this soon so I'm asking here due to the timeline. Given a few more weeks I'm sure I'd stumble across it in some text.
RTF To Plain Text Conversion
does anyone know of a good PHP "module" -- or something else that I can invoke from a PHP script -- that will perform a simple conversion from Rich Text Format (RTF) to plain text with line breaks? I want to store some data in a MySQL database in RTF and allow users to preview the data as unformatted text (except for line breaks/paragraphs) on a webpage before deciding whether to download a file containing the RTF data. I'd rather not try to hack something out myself if I don't have to. The RTF files are likely to be created with Microsoft Word.
Convert MS Word / Rtf / ... To Plain Text
i'm looking for standalone libraries that convert documents to plain text so i can let people edit the text in a textarea after uploading. One thing to notice is that i can not use COM because i can't configure the webserver. Does anyone has interesting classes that are able to do this. I found a PHP class for ms word documents at http://obninsk.name/obninsk_doc/ but that doesn't work at all for my word documents.
Mail() Plain Text Vs. Html Format
I have been testing the mail() code below using MS Outlook and Outlook Express and a hotmail account and the details sent are always in "plain text" format, which results in the information being nicely aligned (incidentally the e-mail contains order confirmation with lots of columns). However, my customer came back to me this morning to tell me that all is not well ! And rightly enough, when I looked at the snapshot he sent me he is receiving it in "html" format. What am I doing wrong ? Keeping in mind that I am a PHP greenhorn ... Can anyone help. Thanks in advance ! $headers = "From: info@somecompany.com "; $headers.= "X-Sender: <info@somecompany.com> "; $headers.= "X-Mailer: PHP "; $headers.= "X-Priority: 1 "; $headers.= "Return-Path: "."<info@somecompany.com> "; $headers.= "cc: info@anothercompany.com "; $headers.= "bcc: me@mycompany.com "; $headers.= "MIME-Version: 1.0 "; $headers.= "Content-type: text/plain; charset=iso-8859-1 "; if(@mail($to,$re,$msg,$headers)) { // tell them all was sent fine } else { // give an error message }
Inserting/parsing Plain Text With 'require'?
I am trying to setup a very simple site that will pull text files into an existing template. I am using a simple require statement, such as: <?php require "/www/companyname/body.txt" ?> The first problem is that it does not seem to respect the linefeeds, which are saved in Unix format, and just lists it as one massive block of text. The second problem is that, obviously, it does not convert symbols such as '&' to '&'. The reason behind this way of including text into HTML files is so that the lecturers can write articles without having to deal with HTML and the articles are inserted into the HTML templates with the 'require' statement. Also, the shear number of text documents that need to be posted would cause a lot of work. I have looked at Project Midguard, but I tend to shy away from applications with little documentation, even though that would be absolutely ideal.
Sending Both HTML And Plain Text Email.
I am using php to send weekly newsletters to my mysql database, the emails are always in HTML only. I was wondering if anyone knew how to send both types so that if they can't view HTML emails it will show just text?
Plain Text Email Spacing Issues?
PHP Code: // create final message, $text refers to the textarea they typed the original message in $message2 = "Dear $firstname, $text Regards, The Team......
Application/octet-stream Vs Text/plain When Uploading
when testing Zend's file upload script, i uploaded a file (sql.txt that was a sql backup) and $_FILES reported it as text/plain as it should. as soon as i renamed it to sql.sql, $_FILES now reports it as application/octet-stream. all i did was rename the file. to make matters worse, i thought Windows XPpro was adding some bits to the file to explain $_FILES new type so i renamed sql.sql to sql.exe; $_FILES now says it is text/plain, even with the .exe extention. does anyone know why adding the .sql extention would change the type from text/plain to application/octet-stream? btw, the webhost is a linux box (RH). more tested extentions (renaming sql.txt to the following extentions) .sql - application/octet-stream .php - application/octet-stream .gz - application/octet-stream .tar - application/octet-stream .htm - text/plain .html - text/plain .txt - text/plain .exe - text/plain .gif - text/plain .jpg - text/plain .asp - text/asp .rpm - audio/x-pn-realaudio-plugin .wav - audio/x-wav .mp3 - audio/mpeg (all i'm doing is changing the extention, nothing more. opening the file in notepad looks all ascii, no funky characters)
Phpmailer.class Messages Are Been Converted To Plain Text...
I am using a phpmailer class to send some staff over the email... I am tring to send it with text/html but for some reason the email are been converted to plain and all the headers are shown, here is the email... X-Tour4Less.co.il Mailer: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="b1_27976937fb6a931b3ed2d40aebd76a26" --b1_27976937fb6a931b3ed2d40aebd76a26 Content-Type: text/plain; charset = "windows-1255" Content-Transfer-Encoding: 8bit *יסיון עברית --b1_27976937fb6a931b3ed2d40aebd76a26 Content-Type: text/html; charset = "windows-1255" Content-Transfer-Encoding: 8bit <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"<html> <head> <META HTTP-EQUIV="Content-Type" content="text/html; charset=windows-1255"</head<body<p dir=RTL><span lang=HE>שלום חברת Tour4less contact .</span></p<p dir=RTL><span lang=HE>משתמש של האתר שלנו </span><span dir=LTR>TOUR4LESS.CO.IL</span><span dir=RTL></span><span lang=HE><span dir=RTL></spanהתעניין ביצירת קשר איתו ע"י.........
Sending Plain Text E-mail, Trying To Track Accesses
i am using php to dispatch from time to time using the mail() function. i have the message split into an html form and a plain text form, only one displays depending on the recipients mail client. my html message includes a 1x1 "image" that is really a php script, which allows me to track reads on the html version... but i don't know of a way to track reads/views/accesses/etc on the plain text version. is this possible?
Best Way To Extract Text From Website
I am trying to extract the text from about 1500 pages so I can dump the info into a database, but I would like to be able to keep the <br> and formatting around the text, but strip out all of the rest of the pages code.
Extract Text From Word Documents
Is there a way to extract the text of a word document with php? And perhaps some of the formatting (like break lines, bold, italic,...)?
PHP4 : Extract Text From HTML File
I would like to extract the text in an HTML file For the moment, I'm trying to get all text between <tdand </td>. I used a regular expression because i don't know the "format between <tdand </td> It can be : <tdtext1 </td> or <td> text1 </td> or anything else eregi("<td(.*)>(.*)(</td>?)",$text,$regtext); The problem is that, if I have <tdtext</td> <td>text2</td> regtext will return text</td><td>text2. How can I change the expression so that it stops at the first occurence of </td>?
Q: Extract Only Text Lines Visible In Webbrowser?
I need to web query for further processing, f.ex http://moneycentral.msn.com/investo...YMBOL=F,MSFT,DE I use <?php $MSN="http://moneycentral.msn.com/investor/external/excel/quotes.asp?SYMBOL="; $QSymbols="F,MSFT,DE"; $QStr="$MSN"."$QSymbols"; $lines = file ($QStr); ?> However, the relevant information is only 3 lines on the webpage, but more than 1000 lines in $lines, so it is slow and tedious to work with. Is there any way to extract only the lines visible in the browser?
Extract Data From A Web Page?
with PHP, can somebody direct me or give me some insight on how to go about extracting data from a web page. Say I want to pull sports statistics from a page.. how is this done?
Extract Specific Parts Of A Page
I have a js script that runs from a remote location, it displays the latest lottery results, but there are ad's on it, i would like to extract the gif/jpegs from it, can it be done and if it can how??
Php Script To Filter A Text File And Extract Lines Starting With Keyword?
For a class, students are going to run an experiment on line. Each time a subject runs, his/her data is appended to one giant text file. Their own data set will be just one line starting with the keyword they gave as identification. The faculty does not want the students to be able to download and see the giant data file. He wants the students to only download and see the data that starts with their own identification tag. in unix, filtering a file to keep only the line starting with code MCB would look something like tail -f your_file_name | grep MCB from what I read. Given the concerns the faculty has for protecting the database, what do I need to look into to write a php script that would access the data file, but only show a web page with the data corresponding to the students identification code?
Getting The Text From A Web Page
I need to take a web page and parse the text from the site. I am thinking of using regex to kill the script and style sections, then strip tags or more regex to kill the reset of the non-text. I need to have the text categorized as sentences, so I some how have to keep track of how it is grouped with other text. I have two questions: 1) What do you guys think is the best way to do this? Regex, explodes, iteration, building a DOM structure, callbacks? 2) What would the regex look like if I wanted to find <, then any number of whitespace chatarers, then the word script, then anything except for (<, possible whitespace, /script, and then anything until >). Put another way, how can I match the style section of a document?
Plain English GD Installation?
I finallly have php3, apache and MySQL installed and running. I would like to install or get GD working. I am on Win98, php3, apache 1.3.9 and downloaded GD 1.8. I looked over the readme, but does'nt make a lot of sense to me. Can anyone point me in the direction for, say a "GD install for dummies"?
Q: Read Text With VBA From A PHP Web-page
I created a page on our intranet that shows a number and that increases for every time the page is opened. It is similar to a visitors-counter. When I look at the page with Internet Explorer it works just fine. Now I want to read this web-page from a MS-Word macro and include the number as a company wide unique id in my MS-Word document. Unfortunately, the PHP script doesn't update the counter when I call it from my MS-Word macro. How can I force PHP to update my counter when I call it from a VBA macro? I am using the following code:
Finding Text Within A Page
Basically I want to search a page for words and if they are found I want the script to do one thing. If they're not found, I want it to do something else.
Plain PHP Implementation Of Hash Function
I have a problem compiling the hash function from PECL into my PHP. I get the error configure: error: C preprocessor "/lib/cpp" fails sanity check I would like to use a plain PHP implementation of these functions. Is there a library of them around?
How To Place Formatted Text On The Php Page
I'm entering data through textarea, and I want to display the same data on the form with all formatting settings. means, if I pressed enter while entering text in the textarea to make a separate line, the separate line has to come when displaying on the form also.
Changing Text On Page Via Dropdown Box.
I have a webpage that draws various text decriptions from a mySql database. The description displayed on the page is controlled via a dropdown box. Currently I am using Javascript with a dropdown box (onChange handler) to change the value of a textbox. Can anybody think of a way that I don't have to use a textbox(with its' scrollbars and background)? How can I change the value of a php variable by changing the selected index of a dropdown box?
Display A Text File On A Web Page
I have a text file that I am trying to display on a web page. If I cat or more the file it formats and displays fine. When it comes up in the browser it seems to loose tabs and the format gets messed up. This is how I display the file. $show = file("./fields/combined/$cdp$store"); $arrayitems = sizeof($show); $x=0; while ($x < $arrayitems) { print("$show[$x] "); $x++; } If I edit the file it has ^M at the end of each line if it matters. Does anyone have a better idea as to how to display it?
Displaying Text While Page Is Loading
i'm trying to display text while a page is loading using a method similar to the following: <? ob_end_flush(); echo 'AAA<br>' flush(); sleep(10); echo 'BBB' ?> in this script, AAA and BBB appear at the same time - when the page has fully loaded - which is not what i want (i want AAA to appear and then 10 seconds later, BBB to appear). pursuant to the suggestions on php.net's entry for flush, i've also tried the following to no avail: <? echo 'AAA<br>' ob_flush(); flush(); sleep(10); echo 'BBB' ?>
Text File Download On .php Page
I have created a website in which I want to put a link to download a text file. When I used simple: <a href="Dir/File.txt">Download file</a> I had such problem that instead of dowloading the file contents was displayed in browser. In some book on PHP5 I found a "solution": create a...
Phpmailer Class Converted To Plain On Some Servers...
I am using a phpmailer class to send some forms over the email... And the problem is, that some ppl (especially problematic for me is the buyer....) getting the email as rough data (sorce...) here is the emal itself as they get it (the headers are below...) Code: X-Tour4Less.co.il Mailer: MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="b1_27976937fb6a931b3ed2d40aebd76a26" --b1_27976937fb6a931b3ed2d40aebd76a26 Content-Type: text/plain; charset = "windows-1255" Content-Transfer-Encoding: 8bit рйсйеп тбшйъ --b1_27976937fb6a931b3ed2d40aebd76a26 Content-Type: text/html; charset = "windows-1255" Content-Transfer-Encoding: 8bit <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"<html> <head> and the rest of the email... and here is the headers... Code: ESMTP; 01 Sep 2006 12:11:23 -0000 Received: (qmail 32712 invoked from network); 1 Sep 2006 05:11:23 -0700 Received: from localhost (HELO http://www.tour4less.co.il) (127.0.0.1) by localhost with SMTP; 1 Sep 2006 05:11:23 -0700 Received: from phpmailer ([88.153.9.8]) by http://www.tour4less.co.il with HTTP (PHPMailer); Fri, 1 Sep 2006 05:11:23 -0700 Date: Fri, 1 Sep 2006 05:11:23 -0700 To: undisclosed-recipients:; From: "Tour4less.co.il" <########## // here was an email i delited... Subject: Contact Email from Tour4Less.co.il Message-ID: <27976937fb6a931b3ed2d40aebd76a26@www.tour4less.co. il> X-Priority: 3 X-Mailer: PHPMailer [version 1.71] X-Virus-Scanned: amavisd-new at sce.ac.il Return-Path: ############## // here was an email i delited... X-OriginalArrivalTime: 01 Sep 2006 12:11:02.0551 (UTC) FILETIME=[B168A270:01C6CDBF]
I Want To Create Web Page Acting As Text Editor?
What I am trying to do is to create a tutorial for my beginning students for JavaScript and PHP, using a very simple online editor like the one @ w3schools, So the left window will be the text file code, such as: <html> <body> <script>document.write("This is a test.");</script> </body> </html>...
Paste HTML Page Inside Text Area
i want to create a newsletter or emailer form sender. and that form has a text area. can be an ordinary text area or can also be tinymce. i want to have the text area in such a way that an HTML page can be pasted in the text area (let say the html page is jobpost.html) and send it to the email address that i specified.
How To Populate Data Into Text Fields On Page Load??
I am creating an account management page that shows the user's contact information and I would like to populate the textfields so that the user doesn't have to type in all his information when he wants to change something. I want to have him type in only the information he wants to change or update. I already have the data displayed on the page. But now I would like to take the data and fill in the textfields with it.
Clearing The Input Text Field After Submiting The Form To A New Page
I have a page with a form with an input text field ... and when a user types his/hers ID ... and submits the form - his/hers page opens in a NEW window .... but the value (ID) that was written in the original window remains. I want to make sure that the form input field on the original page becomes "invisible" - by refreshing/reloading the original page or just by clearing the input form after submission. For now I have put in a meta refresh tag in the head of the html file ... but that is not the solution ... becouse I only need a one time refresh/reload of the page ... and this is to happen right after the input text value is being submitted.
How To Change Template Menu Link Text Color For Page Currently Viewed ??
Say you have a left menu that stays the same on every page within in your site. Obviously you would make this menu a template or library item, so that when you edit one version, all the others are updated. Let's say the left menu is the following: Page_1 Page_2 Page_3 Now, when I am viewing Page_1, I want the "Page_1" text in the left menu to be red, so that the viewer knows what page they are on. When I click on the left menu link for "Page_2", then I want the "Page_1" text to go back to black and the "Page_2" text to be red.
Write Text To An "image-page"
If I generate an image, is it possible to write text on the same page then? If it isn't so, there's no special meaning to generate images, who wants to surf on a page that is out of content, right? <? Header("Content-Type: image/gif"); $img = ImageCreate(100, 100); $black = ImageColorAllocate($im, 0, 0, 0); ImageFill($img, 100, 100, $black); ImageGIF($img); ?>
Extract Tag Value
I have a 3GB XML file thatI need to parse. since the file is too big to read all at once in to an array, I am reading it one line at a time. I need help on how to extract the contents of some tags, say name, address etc. Here is the code: ....
Extract From String
Its saturday morning and my mind has gone blank! How do i extract something from a string between two tags? For example, $line = "Hello <1234567> This is test"; I want to get everything thing inbetween the '<' & '>' tags into a variable. So the $var= 1234567; Can someone please point me in the right direction or tell me what to search for!
Extract Function
I am now working on a method to extract the unique record IDs from the search query output. I am doing this because what I would like to do is build a hyperlink to the details page. I am going to write a separe query to pull up the record details. I was wondering if I could pull this information out using the Extract function?
|