Regalar Expression To Extract An Html Link From A Page
I have a regalar expression to extract an html link from a page:
href=(["']?)([^>1]*.html)1(?: [^>]*)?>
It looks after the "href" for an optional quote and then looks for something
that is not the quote or the endarrow.
The problematic part is [^>1]*. It should exclude anything with the quote,
but somehow that doesn't work. Maybe 1 is not allowed inside brackets?
I would like some advice on how to handle this.
View Complete Forum Thread with Replies
Related Forum Messages:
Regular Expression To Get A Page's HTML Into A String
First wanted to say that PHPBuilder seems like an amazing site to me so far. Until a couple weeks ago I had no idea what PHP was. I recently bought a book and have already learned a decent amount, and am excited by how much PHP has to offer. Anyway, here's my question: I want to get a page's HTML into a string, and then parse the string and return the first expression in standard US dollar form... $(one or more digits).(2 digits) How do I go about doing this with, let's say, this page? That is just an example, it doesn't have to be Amazon.
View Replies !
Regular Expression - Get The Text From Html Page
I've been trying unsuccessfully to get the text from html page. Html tag that I'm interested in looks like this: <a class=link href="http://www.something.com/_something.php?type=cart">Shopping Cart</a> <div><em class=newentry><a href=http://nothing.com>New Age</a></em></div>
View Replies !
Html Page Link Prints Path 3 Times In Header
I have a website with html pages and a php page (a form). I just put in links on the html pages to the php page. When I click them I go to the php page and as bonus the path of the html page is printed at the top three times. (The text in the php page changes depending on which page I have hit the link from.) For example: Quote: http://mysite.com/sitefolder/previouspage.htmlhttp://mysite.com/sitefolder/previouspage.htmlhttp://mysite.com/sitefolder/previouspage.htmlhttp If I highlight and then delete this text in Navigator or Explore both browsers return me to the page written out three times at the top of the page. Here is the code that is giving me this result. Code:
View Replies !
Regular Expression <h1> Extract
I need some help with regular expression, what I want is to take all the code from the <h1> tag and separate it. here is what I want: there is alot of html code and tags such as <img> <br><p> etc etc// but I only want to extract all <h1> tags data.. <h1><a href="http://yahoo.com/" target="_blank" class="myClass">my keyword</a></h1> I want to have the URL/link receive in $url variable and anhor in $anchor variable.
View Replies !
A Web Development QUICK LINK PAGE (QLP) - HTML, Perl, PHP, JavaScript, AJAX, CGI, Etc.
I've recently organized and even color-coded many of my favorite bookmarks on WEB DEVELOPMENT (and a few other favorite subjects too) into what I call QUICK LINK PAGES. These are very condensed, compact (no graphics), fast-loading pages with a 100+ links to some of my favorite web sites on a particular subject. I hope you'll give them a try... Here's the link... The easy-to-remember link above gets you to one of the Quick Link Pages (QLP). The current categories (DIVERSIONS, INVESTING, JAZZ, MACINTOSH, OPERA, PHYSICS (with ASTRONOMY and MATHEMATICS), SPORTS, WEB DEVELOPMENT, and WINDOWS) are color-coded on the top of every page. Just click your favorite category. I hope you'll find a lot to enjoy. If you find any errors, or have suggestions for additional links or categories,
View Replies !
Regular Expression :: Extract Images
/<imgs.*srcs*=s*"(.*.(gif|jpeg|jpg))".*>/ works to extract img strings. How would I include all image links and exclude just .gif? I also would like to include image links that may not even have a . extension.
View Replies !
Regular Expression :: Extract Between Brackets
I'm looking for a regular expression but I can't figure out how to do it right. I've got a string like this: '{var1} - {var2} foo bar foo bar {var3} etc.' Now I wan't to extract all the bits that are within the brackets. With my regular expression: ({)(.*)(}) I get everything from the first bracket until the last and the ones within are discarded. I guess I have to tell the expression that no occurences of '{' are allowed within, but I don't know how to do this ..
View Replies !
Regular Expression - Extract Data Problem
I am trying to extract hello world out of the following html string if the "message" string is found in the same row. I am not sure why it's not working. <?php $string = '<html> <body> <table> <tr> <td class="bg-grey-m"> <table cellpadding="12"> <tr><td>message</td><td><i>hello world</i></td></tr> <tr><td>tel</td><td><i>111-111-555</i></td></tr> </table> </td> </tr> </table> </body>' preg_match_all("/<td [^>]*class="bg-grey-m">[^<]+<table[^>]*>.*message.*<i>(.*)</i>.*</table>.*/iU", $string, $matches); foreach($matches[1] as $link) { echo "<li>$link</li> "; } ?>
View Replies !
Regular Expression Extract Data Between Tags
preg_match("/<some_tag >([^']*?)</some_tag>/", $data, $matches) preg_match_all("/<p>(.*?)</p>/", $matches[1], $paragraphs); foreach ($paragraphs[0] as $paragraph) { $content=$content.$paragraph; } The above code only works if <pis the first tag under <some_tag>. i.e, works with <some_tag > <p>blah</p> </some_tag> but not with ....
View Replies !
Regular Expression :: Extract The Src Links And Titles Of All The Images
I want to extract the src links and titles of all the images that do have a title in their <img> tag in an HTML page. The problem is that I don't know if the title attribute of the <img> tag is before, after or separated by other attributes from the src attribute (which contains the URL of the image). I want a regex that matches both the title specified by the "title=" attribute and the URL of the image specified by the "src=" attribute of the same <img> element. I want it to match the src and title (in two different parenthesized subsets of the regex) if, and only if, both the title and src attributes are present in the <img> element in no specific order relative to each other. I guess it may require some conditional statement or so.
View Replies !
Extract Mail From A Link
im new at this ussualy i take scripts alrdy made what I need is a php script that will extract a email address from a link without the user 2 see the other infos or the link ofc for example: i need a box where users will type the userid and that box will add that typed user at the end of a link https://www.website.com/userid=USERHERE and the loaded link will show more infos and i need just the email address 2 be extracted from the link and showed 2 that user.
View Replies !
Using Preg_match_all To Locate HTML Anchor Link BUT Only If The Link Is A .pdf File
The subject line describes what I'm trying to do, (and after thinking about it for a day or so and trying different things; searching around for similar questions on the board,) but I still haven't found a proper regular expression. I am trying to use preg_match_all to locate HTML Anchor link BUT only if the link is a .pdf file within text from a database table. It actually was working just fine until I recently made a change regarding new lines () and (<br />) in the text that is to be searched. But after making changes regarding the new lines and things, my previous regexp doesn't work correctly. Here is the regexp I am trying to use: preg_match_all('/<a href="(.*)">(.*)</a>/U', $entry->Record['e_entry'], $res_output, PREG_PATTERN_ORDER); That regexp does find each occurance of an <a href="">something here</a> link... but I need it to only find occurances of an anchor link if it is a PDF file link (ex: <a href="../something.pdf">something</a>). So I tried changing the regexp to something like this: preg_match_all('/<a href="(.*).pdf">(.*)</a>/U', $entry->Record['e_entry'], $res_output, PREG_PATTERN_ORDER); But in the preg_match_all results array ($res_output) a regular <a href="">something here</a> link is ALSO found as well as the .pdf links... I am trying to only find links that contain .pdf at the end of the file name. Sorry that this is written kind of strangely, if you need more info let me know. Does anyone know what I need to change in the regexp to ONLY find HTML anchor links that contain .pdf at the end of the HREF?
View Replies !
Regular Expression : {link}
text with {link:pagehref}a link{/link}. replace to -> text with <a href="pagehref">a link</a> I tried several things but nothing seems to work... e.g. $value=preg_replace("/{link:(.+?)}(.+?){/link}/s","<a href="$1" target="_blank">$2</a>",$value);
View Replies !
Regular Expression - Link With Image
I'm trying to retrieve URLs that are directly linked to jpg or jpeg images. example <a href='site.com/picture.jpg'><img src='tn_picture.jpg'></a> This is what I have so far. "/(?i)<a([^a]+?)href='([^a]+?)'/i", "/(?i)<a([^a]+?)href="([^a]+?)"/i", "/(?i)<a([^a]+?)href=([^a]+?)[ |>]/i"
View Replies !
Extract Headlines From A HTML File.
I try to write a simple web crawler. It has to do the following: 1) Open an URL and retrieve a HTML file. 2) Extract news headlines from the HTML file 3) Put the headlines into a RSS file. For example, I want to go to this site and extract the headlines: www.unstrung.com/section.asp?section_id=86 The problem is I do not know howto extract a headline from a HTML file. I mean HTML is not structured as XML, so I do not really know to solve this problem. I notice that PHP has URL Functions to deal with HTML file. For example, you have get_meta_tags () to extract meta tag content attributes from a HTML file. But then, extract meta tag is easy. With headlines, I don't really know where the headlines are on a HTML file. Would anyone give me inputs on this? This is not an impossible problem. If you look at Google News (http://news.google.com/), they crawl the web and sort the headlines on their site.
View Replies !
Extract Records From HTML Of Another Site
First of all let me say I'm new to php. I pieced the following code together from samples I found on the net and a book I bought called PHP Cookbook. So please forgive me if this isn't the best approach - I'm open to suggestions I finally got my code to work that logs into another site and pulls the orderstatus page to my server. <?php /* Login to site */ $ch = curl_init(); curl_setopt($ch, CURLOPT_COOKIEJAR, "/tmp/cookieFileName"); curl_setopt($ch, CURLOPT_URL,"https://www.homier.com/default.asp?page=signin"); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_POSTFIELDS, "EMail=homierorders@swbell.net&Password=1040ez"); ob_start(); // prevent any output curl_exec ($ch); // execute the curl command ob_end_clean(); // stop preventing output curl_close ($ch); unset($ch); /* Dump html of orderstatus page into a file on my server */ $fh = fopen('raw_orderstatus.html','w') or die($php_errormsg); $ch = curl_init(); curl_setopt($ch, CURLOPT_RETURNTRANSFER,1); curl_setopt($ch, CURLOPT_COOKIEFILE, "/tmp/cookieFileName"); curl_setopt($ch, CURLOPT_URL,"https://www.homier.com/default.asp?page=orderstatus"); curl_setopt($ch, CURLOPT_FILE, $fh); curl_exec ($ch); curl_close ($ch); ?> My problem: How can I capture only the data in the "<td class='n8n_CCCCCC_default>" tags? Is there a way to do this at file creation? I checked with my ISP and I can't use LYNX -DUMP file.html The goal here is to load these records into MYSQL database.
View Replies !
PHP4 : Extract Text From HTML File
I would like to extract the text in an HTML file For the moment, I'm trying to get all text between <tdand </td>. I used a regular expression because i don't know the "format between <tdand </td> It can be : <tdtext1 </td> or <td> text1 </td> or anything else eregi("<td(.*)>(.*)(</td>?)",$text,$regtext); The problem is that, if I have <tdtext</td> <td>text2</td> regtext will return text</td><td>text2. How can I change the expression so that it stops at the first occurence of </td>?
View Replies !
Using PHP To Parse Html Tables And Extract Values
I've been presented with a task of parsing multiple .jsp's (this is after they have been executed server side so I guess for all purposes its actually a html file). Anyway each of these pages have large complex tables displaying a lot of reporting data for one of our systems. My original method of carrying out this task was to go into the code and get the actual DB querys that the page executes and have this more as a bash based solution. However after spending several days trying to hack my way through a jungle of 100's of querys which dont hold to any naming convention Im going to plan B. So here's what Im looking to do. Get php to construct the correct url for the jsp. What I mean by construct is to make the url while dynamically inserting the correct values into the url as it uses GET to set the date range of the information it writes to the browser. Once its done that and requested the page is processed I want php to search through the page and find the results that Im looking for, assign them to variables and finally format the information from all the different jsp's into one php page. One nice thing is that I'm able to modify the .jsp's to wrap a comment around the data I want for example. I think this should remove the hardest part of the job which is having php identify what values I actually want. #take_this_value# 1234556 ####### What I dont know is how to get PHP to request the url I create,parse it and extract the values. I'm guessing this is a job for wget and regular expressions but Im not too sure where to start (or if there is more appropriate functions to use).
View Replies !
Extract Html Table Cells And Put To An Array
i have a table like: <tr><td>headA</td><td>headB</td><td>headC</td><td>headD</td></tr> <tr><td>1a</td><td>1b</td><td>1c</td><td>1d</td></tr> <tr><td>2a</td><td>2b</td><td>2c</td><td>2d</td></tr> <tr><td>3a</td><td>3b</td><td>3c</td><td>3d</td></tr> <tr><td>4a</td><td>4b</td><td>4c</td><td>4d</td></tr> where there can be any number of rows and there can be any number of columns. how can i read through this and create an array for each row, and use the header row as the keys. ie have it something like: QuotemyArray[0] = array( 'headA' = '1a', 'headB' = '1b', 'headC' = '1c', 'headD' = '1c', ); myArray[1] = array( 'headA' = '2a', 'headB' = '2b', 'headC' = '2c', 'headD' = '2c', ); etc....
View Replies !
Extract Data From A Web Page?
with PHP, can somebody direct me or give me some insight on how to go about extracting data from a web page. Say I want to pull sports statistics from a page.. how is this done?
View Replies !
Extract Certain Data From Page?
For example if a source had links that were all like http://server.com/dir/file.ext?id=1234567 and I just wanted to extract all of the the numbers after the id= how would I do it? There would be a lot of different links with different id's and I'd like to extract all of them.
View Replies !
Extract URLs From A Web Page
I am setting up a trusted feed for Ink and the third-party provider seems to be having a bit of a problem getting the full list of my URLs extracted. I have a complete site map for this site, which is broken down into 13 pages, each with 100 or fewer URLs. I would like to use some sort of script to pull the URLs and put them in a big list in Excel or Notepad. Is there an easy way to do this? I posted this in PHP because I understand it well; if there is another language that does it better.
View Replies !
Extract Only Plain Text From A Page
Basically, what I am trying to do is write some PHP code that will automatically take text from any web page and eliminate all the HTML, CSS, and JS codes and formatting, leaving only the plain text from the page. I got my code started, but I have hit a snag with javascript and css codes. This is what I have so far: <?php $geturl = $_GET["url"]; ob_start(); include($geturl); $page = ob_get_contents(); ob_end_clean(); $output = ereg_replace('<script.*.</script>', ' ', $page); $output2 = ereg_replace('<style.*.</style>', ' ', $output); $plaintext = strip_tags($output2); echo $plaintext; ?> The strip_tags function automatically removes all html tags, but it doesn't do anything to javascript and css because html code is not provided between the beginning and end tags, whereas javascript and css codes are both contained within two separate tags, like this for more clarification: html: <div name="htmltag">Keep this text here</div> javascript: <script>function somejs() {remove all this code}</script> As you can see, the text between the div tags should stay, but the js between the script tags should be removed because it is code. I then tried the ereg_replace function to get rid of js and css codes, but there is a problem when there is more than 1 piece of js or css code. The wildcard value (.*.) skips over any ending script or style tags until it reaches the last ending tag, therefore deleting all the text between the two pieces of code. Example: <SCRIPT>function somejs() {remove all this code}</script> //removes all text and code from beginning here KEEP ALL THIS TEXT HERE <script>function somejs() {remove all this code}</SCRIPT> //to end here Now finally down to the question, is there any way to only remove the js and css code between the beginning tag and the immediate next ending tag? Or is there any other way to get rid of the javascript and css codes?
View Replies !
Extract Specific Parts Of A Page
I have a js script that runs from a remote location, it displays the latest lottery results, but there are ad's on it, i would like to extract the gif/jpegs from it, can it be done and if it can how??
View Replies !
Extract Each Part Of A Page That Are Enclosed In A <b>-tag
I would like to extract each part of a page that are enclosed in a <b>-tag, but ONLY if there are no other tags enclosed. <b>some text</b> <-- a match <b>some text <i>in italic </i></b> <-- should NOT match This is what I have so far: $pattern = "/<b>(.*?)</b>/si"; preg_match_all($pattern, "<b>some text <i>in italic </i></b>", $out); Any thoughts?
View Replies !
HTML Tags Regular Expression
preg_replace('/test/', 'replacedtext', $text),1) I need the following to replaced all instances but which is not in HTML tags. Excluding HTML tags. How would i do the line above excluding them?
View Replies !
Extract The TITLE Section Of A Web Page Using Preg_replace()
I went through some examples, tried a bunch of things. but still can't figure out why I can't extract the TITLE section of a web page using preg_replace(): <?php $response = file_get_contents($url); $output=preg_replace("|<title>(.+?)</title>|smiU", "TITLE=$1", $response); $fp = fopen ("output.html", "w"); fputs ($fp,$output); fclose($fp); -----------
View Replies !
Regular Expression Strip HTML Tags
I'm stripping out the attributes in <TD> tags...but I want to strip out everything BUT the COLSPAN attribute. The following strips out all attributes. What do I do if I want to keep a certain one? eregi_replace("<TD[^>]*>","<TD>", $string);
View Replies !
Regular Expression :: HTML Into XHTML Code
To convert HTML into XHTML code, e.g. make <br> <br /> Code $text = preg_replace('/(<img .*)("|'| )>/i','12 />',$text); The problem It removes the / from the source. E.g. <img src="../myimages/blah.gif"> turns that to: <img src="..myimages/blah.gif" />
View Replies !
Regular Expression :: Convert HTML Into XHTML
I've been fiddling with it for ages now. To convert HTML into XHTML code, e.g. make <br> <br /> Code $text = preg_replace('/(<img .*)("|'| )>/i','12 />',$text); The problem It replaces some other tags too. For example I have: <a href.....> but it changes it to <a href...... />
View Replies !
Regular Expression - Html Markups With Qoutes, Parenthesis
There are some image coding that has parenthesis and some without. I can only pull the image only if there are quotes. Thr rest doesn't pick up. ie: <img src='image.jpg'> and <img src=image.jpg> Here's what I have: preg_match_all('/<img[^>]+srcs*=s*(["']?)?([^>s]+)1[^>]*>/i', $img, $pic); $img = $pic[0][0]; $sery = preg_split('/src="/', $img); $sery = preg_split('/.(jpg|jpeg|png|gif)"/', $sery[sizeof($sery)-1]); $img = $sery[0].".jpg";
View Replies !
Setup A Link In My Page That Will Change My Page
I am working on a page with a right column that I want to use for navigation. In this right column I am using the below code to set a value for the link. I am using the variable $test right now. I want to click on the link and when the value is set to a certain value, say 1, I want the script to run and load a page based on a switch case. This way I can use different links for navigation that will load different forms and areas of my application. So here is the code for the link line. Code:
View Replies !
Regular Expression To Remove All The HTML Elements And Only Leave The Plain Text.
I am trying to take some HTML and remove all the HTML elements and only leave the plain text. Basically I am trying to extract information that was put into an HTML table. I have a regular expression which catches the HTML elements, it is <.*?> but I actually want the inverse of this, that regular expression returns the HTML elements to me, I want the plain text. I tried doing [^<.*?>]+ but square brackets will only work with 1 character at a time, so it is not seeing <.*?> as a whole.
View Replies !
Regular Expression Detecting Entire Page
I am making a flat file, static HTML search engine for a site. I downloaded a script from the net and have been working with it for my needs. Everything was working OK with a few test Lorem Ipsum pages. But the moment I try to search real data, fit hits the shan. The script uses a regular expression to search through the files. And for most of these files, the regular expression doesnt seem to pick up the matches as it should, and for lack of a match, it outputs the entire html page as a hit. The search word is on the page somewhere, but it still outputs the entire page. Allow me --------8<---------------------------- if(preg_match_all("/((sS*){0,3})($keyword) ((s?S*){0,3})/i", $portion, $match, PREG_SET_ORDER)); { if(!$limit_extracts) $number=count($match); else $number=$limit_extracts; for ($h=0;$h<$number;$h++){ // no limit if (!empty($match[$h][3])) $text = sprintf("... %s<font class='keyword'>%s</font>%s ...", $match[$h][1], $match[$h][3], $match[$h][4]); else{ //print_r($match); } } } --------8<---------------------------- There's the regex that looks through $portion, which is the strip_tags version of the file's contents. And if I echo $portion right before that line, I see the stripped code. However, when I get to the line where it checks $match[$h][3] for the keyword that was searched for, it craps out. Not 100% of the time, but most of the time. Im trying to figure out details about these html pages, to no avail. So as a result of failing that empty test, the entire html page is dumped out as a search result. Not being a regex expert, Ive had a hell of time troubleshooting. But I feel the problem lies in there. Something with the regex not finding the keyword correctly, or something.
View Replies !
HTML Link
The select options include each of the products which are coded by an ID number. To create a link in the outgoing mail by PHP page. I have a var coming from a Select Option $productid. It is written as <OPTION VALUE="23532"> What I need to do is have this provide a complete link and name of the product. I would like to parse the databse of the store so that if the ID number is provided, on the resulting php page that sends the mail and displays the results on the screen, I want to read from the data base to complete the link that goes on to the screen and into the mail Thus pulling $product_name $product_description $product_thumbnail and composing it into a preformatted HTML email message. <P ALIGN="CENTER"><IMG SRC="$imagelink" width="100" height="120"><BR> <A HREF="http://www.foobar.com/products.php?productid="$productid"> $product name</A> $product_description</P>
View Replies !
Url Html Link
Rather than make my own function Im just wondering if there is a predefined one or one you guys know of I want to have something that will turn a link someone has entered (www.website.com) and wrap it in html to make it a link so: <a href="http://www.website.com" target="blank">www.website.com</a>
View Replies !
|