29 June 2008

Notepad++: A guide to using regular expressions and extended search mode

The information in this post details how to clean up DMDX .zil files, allowing for easy importing into Excel. However, the explanations following each Find/Replace term will benefit anyone looking to understand how to use Notepad++ extended search mode and regular expressions.

If you are specifically looking for multiline regular expressions, look at this post.

You may already know that I am a big fan of Notepad++. Apparently, a lot of other people are interested in Notepad++ too. My introductory post on Notepad++ is the most popular post on my speechblog. I have a feeling that that is about to change.

Since the release of version 4.9, the Notepad++ Find and Replace commands have been updated. There is now a new Extended search mode that allows you to search for tabs(\t), newline(\r\n), and a character by its value (\o, \x, \b, \d, \t, \n, \r and \\). Unfortunately, the Notepad++ documentation is lacking in its description of these new capabilities. I found Anjesh Tuladhar's excellent slides on regular expressions in Notepad++ useful. After six hours of trial and error, I managed to bend Notepad++ to my will. And so I decided to post what I think is the most detailed step-by-step guide to Search and Replace in Notepad++, and certainly the most detailed guide to cleaning up DMDX .zil output files on the internet.

What's so good about Extended search mode?

One of the major disadvantages of using regular expressions in Notepad++ was that it did not handle the newline character well—especially in Replace. Now, we can use Extended search mode to make up for this shortcoming. Together, Extended and Regular Expression search modes give you the power to search, replace and reorder your text in ways that were not previously possible in Notepad++.

Search modes in the Find/Replace interface

In the Find (Ctrl+F) and Replace (Ctrl+H) dialogs, the three available search modes are specified in the bottom right corner. To use a search mode, click on the radio button before clicking the Find Next or Replace buttons.

Cleaning up a DMDX .zil file

DMDX allows you to run experiments where the user responds by using the mouse or some other input device. Depending on the number of choices/responses (and of course the kind of task), DMDX will output a .zil file containing the results (instead of the traditional .azk file). This is specified in the header along with the various response options available to the participant. For some reason, DMDX outputs the reaction time twice—and on separate lines—in .zil files. Here's a guide for cleaning up these messy .zil files with Notepad++. Explanations of the Notepad++ search terms are provided in bullet points at the end of each step.

Step 1: Backup your original result file (e.g. yourexperiment.zil) and create a copy of that file (yourexperiment_copy.zil) that we will edit and clean up.

Step 2: Open yourexperiment_copy.zil in Notepad++ (version 4.9 or later).



Step 3: Remove all error messages.All lines containing DMDX error messages begin with an exclamation mark. Let's get rid of them.

Bring up the Replace dialog box (Ctrl+H) and select the Regular Expression search mode.

Find what: [!].*

Replace with: (leave this blank)

Press Replace All. All the error messages are gone.


  • [!] finds the exclamation character.

  • .* selects the rest of the line.

Step 4: Get rid of all these blank lines.

Switch to Extended search mode in the Replace dialog.

Find what: \r\n\r\n

Replace with: (leave this blank)

Press Replace All. All the blank lines are gone.



  • \r\n is a newline character (in Windows).

  • \r\n\r\n finds two newline characters (what you get from pressing Enter twice).


Step 5: Put each Item (DMDXspeak for trial) on a new line.

Switch to Regular Expression search mode.

Find what: (\+.*)(Item)

Replace with: \1\r\n\2

Press Replace All. "Item"s have been placed on new lines.



  • \+ finds the + character.

  • .* selects the text after the + up until the word "Item".

  • Item finds the string "Item".

  • () allow us to access whatever is inside the parentheses. The first set of parentheses may be accessed with \1 and the second set with \2.

  • \1\r\n\2 will take + and whatever text comes after it, will then add a new line, and place the string "Item" on the new line.

So far so good. Our aim now is to delete duplicate or redundant information (reaction time data).


Step 6: Remove all newline characters using Extended search mode, replacing them with a unique string of text that we will use as a signpost for redundant data later in RegEx. Choose a string of text that does not appear in you .zil file—I have chosen mork.

Switch to Extended search mode in the Replace dialog.

Find what: \r\n

Replace with: mork

Press Replace All. All the newline characters are gone. Your entire DMDX .zil file is now one very long line of (in my case word-wrapped) text.



Step 7: We're nearly there. Using our mork signpost keyword, let's separate the different RT values.

Stay in Extended search mode.

Find what: ,

Replace with: ,mork

Press Replace All. Now, mork appears after every comma.


Step 8: Let's put the remaining Items on new lines.

Switch to and stay in Regular Expression search mode for the remaining steps.

Find what: mork(Item)

Replace with: \r\n\1

Press Replace All. All "Item"s should now be on new lines.



Step 9: Let's get rid of those duplicate RTs.

Find what: mork ([^A-Za-z]*)mork [^A-Za-z]*\,mork

Replace with: \1,

Press Replace All. Duplicate reaction times are gone. It's starting to look like a result file :)



  • A-Z finds all letters of the alphabet in upper case.

  • a-z finds all lower case letters.

  • A-Za-z will find all alphabetic characters.

  • [^...] is the inverse. So, if we put these three together: [^A-Za-z] finds any character except an alphabetic character.

  • Notice that only one of the [^A-Za-z] is in parentheses (). This is recalled by \1 in the Replace with field. The characters outside of the parentheses are discarded.

Step 10: Let's get rid of all those morks.

Find what: mork

Replace with: (leave blank)

Press Replace All. The morks are gone.



Step 11: Separate each participant's data from the next.

Find what: (\**\*)

Replace with: \r\n\r\n\1\r\n\r\n

Press Replace All. The final product is a beautiful, comma-delimited .zil result file that is ready to be imported into Excel for further analysis.



Notepad++, is there anything it can't do?


Please post your questions in the comments below, rather than emailing me. This way, others can refer to my answers here, saving me many hours of responding to similar emails over and over.

Update 20/2/2009: Having trouble understanding regexp? I have created a new Guide for regular expressions. Check it out.

402 comments:

1 – 200 of 402   Newer›   Newest»
James said...

Hi, can those steps be automated in notepad++ ? like actions in photoshop?

Mark said...

James, that is the million dollar question. I immediately tried to automate this somehow but could not get Notepad++ to save these steps in a macro. If I find a solution, I will post it.

ninj said...

Nice article!

However, the reason why I arrived on your blog still remains unanswered:

How to replace a multiple line regexp by a simple value (in my case: nothing).

Here is the case:
In Symfony YAML generated files, I have the created_at and updated_at fields dumped, which I don't want.
I need to replace something like this:
/ *created_at:.*\n *updated_at:.*\n/
by
//
The way to do it is important because I want the blank lines to disappear as well.

Of course I know it is possible to do it in two or three steps, but I'd like to find how to achieve it in one only, I'm a regexp maniac ;)

Maybe you or someone else own a solution... i couldn't manage to get one neither through CTRL-H nor through CTRL-R dialogs.

Thanks!

Mark said...

ninj, currently you cannot do this in Notepad++. This is because replacing newlines is possible in Extended search mode, and regular expressions are available in Regexp search mode. You are trying to combine the two search modes, and in the current version of Notepad++ you cannot.

Since I wrote this post, I too have caught regexp mania. If you are serious about using regular expressions for more advanced search and replace (as you are) then you need to use a more powerful text editor. I recommend XEmacs—I've been using it for about a month, and it is very powerful. I'm working on a post for XEmacs right now.

As for your specific problem, it is possible to get rid of the created_at and updated_at information. I would need to see the text file (feel free to send a sample to me as an email attachment). I have made a few assumptions: 1. that created_at and updated_at always occur on consecutive lines, 2. that there is information above and below these lines that is useful. The XEmacs regular expression would be this:

Search for:
\(.*\) newline
.*created_at:.* newline
.*updated_at:.* newline
\(.*\)

Replace with:
\1 newline
\2


Note: In XEmacs, the newline character is created by pressing Ctrl+Q Ctrl+J.

Anonymous said...

Quick bleg. I would like to replace all occurrences of number+comma with number + TAB. So 12.8, 100 would become 12.8 TAB 100.

I'm using "\d," for the [Find What] value and "\1\t" for the [Replace With] value.

Unfortunately I lose that last digit in the number that I'm replacing.

Any help would be appreciated.

Anonymous said...

Ok, I actually figured it out.

The [Find What] value should be "(\d)," and the [Replace With] value
should be "\1\t". In other words I just needed the parentheses around "\d" criteria.

Thanks for the useful article Mark.

Flick said...

Thank you for the guide! I have to admit it's a little advanced for me, and I've only just found out about REGEX expressions, but am still very excited nonetheless!

I'm alittle confused by what to do in my situation. I have a mySQL file that I'd like to run, and the first part of each line is something like this:

INSERT INTO my_table (id,uid,my_msg,my_date,the_ip) VALUES ('2',

I would very much like to be able to change the '2' part to just NULL and REGEX seems to be the way forward. However, I think I'd have to use ( as a unique identifier, and given that REGEx uses brackets as the separators, I'm now a little stuck. Apologies in advance for this simple question, but my brain is really not working today.

Thanks!

p/s: I'll continue looking into it in the meantime.

Flick said...

Just a quick update: I've been able to use Column Mode select (Alt+mouse) to select the column and replace the NULL, since thankfully everything is in the same column!

I wonder if it is still possible in Regex though?

Thanks :)

Mark said...

Hi Flick, thanks for your comments. I do have a regex solution for you that is very easy and quick. Note that this regex syntax is specific to Notepad++.

First, let me answer your question re: the curved bracket (or parenthesis) character: in order to search for and find the open parenthesis character, place the parenthesis within square brackets like this: [(]

However, you do not need to use the parentheses or square brackets at all to achieve what you want to (if I have understood you correctly).
Search for: '.*',
Replace with: NULL

If you do not want to get rid of the comma, then delete it from the search term. If this then stuffs up your search and finds incorrect portions of text, you could insert a comma after null in the replace with expression: NULL,

Anonymous said...

Mark,
Do you have some advice for the following. I have a set of text lines... and I want to delete duplicate lines. But the redundant information will occur only at the beginning of the line, the end of those lines differ in their information. I'm just starting to use notepad++ RegExp utilities, but I'm no whiz yet with the format.
Thanks

Mark said...

That's exactly what regular expressions do. Give me 4-5 lines of your text as an example, and I'll show you the correct regexp.

Anonymous said...

ok... I've made the text file simpler so that the duplicates I want to delete all have the same information.

[19-766]
???^Los Angeles^60-638^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER
[19-767]
???^Los Angeles^60-638^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER
[19-773]
???^Los Angeles^60-638^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER
[19-1581]
???^Los Angeles^60-638^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER

the phrases in brackets, on separate lines, are ignored by the final use of the text file. They can remain, but I do want to delete the duplicates of the ??? lines. I'll have other cities with similar format.

thanks

Anonymous said...

... this group of lines is followed, for example, by:

[19-773]
???^Los Angeles^60-639^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER
[19-1580]
???^Los Angeles^60-639^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER


so the number between the second and 3rd ^s will change throughout the file, as will the county name between the 1st and 2nd ^

Mark said...

Ok, I understand the problem. Can you provide me with what you would like the output to look like after applying the regexp.

For e.g., should it look like this:

Los Angeles 60-638
Los Angeles 60-639

Is this the only useful info? Should everything else be deleted?

Also, are the number of repetitions (lines of redundant info) the same for each city/number?

Anonymous said...

Mark,

As further background, you are looking at the content of the 1930 census districts laundered into the 1940 census districts. I have transcribed a cross table between 1930 and 1940, and we seeded the 1940 EDs with the 1930 information. Those 1930 ED numbers are in brackets, and point to the next text line (where that information came from). Since census districts change boundaries between federal censuses, especially in large cities, you will see multiple 1940 entries from different 1930 EDs that are partially contained within the 1940 ED. I don't think there would be any more than 10 such contribution EDs. For rural areas the data from 1930 to 1940 is accurate, for urban areas we have transcribed street indexes for over 200 large cities, thus instead of repeating their 1940 ED streets (I have scanned 28 rolls of 1940 ED descriptions), I just direct them to the other utility. For smaller areas of 25,000 or more, I intend to get street indexes for them, and have replaced their descriptions with "TO BE DONE BY BOUNDARY OR STREET INDEX".

When there are multiple ED entries for a single 1940 ED # (which is a two part number), they will occur together as a block with no blank line between the various lines. If a 1940 ED has only a single 1930 entry, it should have a blank line above the brackets, and one below the text line.

I fooled with TextFX but it moves the brackets from the text lines, doesn't show a numerical sort of numbers (thus one sees 2, 20, 21, ...) and for some didn't get me to a unique line.

I need the entire line. So for the first example I want:

[19-766]
???^Los Angeles^60-638^LOS ANGELES CITY USE ONE STEP 1940 LARGE CITY ED FINDER
[19-767]
[19-773]
[19-1581]

but I'm willing to give up the brackets lines, but I do want a blank line between the statements.

I've done 2 states, and with California decided to do some more automation. To see Alabama and Arkansas... go to http://www.stevemorse.org/ed/ed.php
and choose 1940 and one of those two states.

Thanks... I'll ask Steve Morse to acknowledge you on the One Step site if you can pull this off.

Joel Weintraub
Dana Point, CA

Anonymous said...

Mark,

Steve Morse wrote a utility to do what I want.

But it was interesting to see if RegEx could do the same thing.

So... thanks for your help... don't do any more.

Thanks

Joel Weintraub

Mark said...

Glad to hear that your problem got solved. Apologies for not responding as quickly as I usually would, but you caught me at a bad time (wedding and honeymoon). My wife doesn't let me post about regular expressions while on honeymoon!

Basically, the problems with your data are twofold:
a) There is no unique identifier in the first occurrence of a 'new' number; and
b) The number of repetitions varies.

You cannot use regexp to compare two strings of text and decide if a change has occurred (i.e. a new number/city, whatever). In summary, getting a parser/utility written was a smart move.

I am writing up a guide about how to use regular expressions, going from basics to more advanced stuff. Stay tuned.

Anonymous said...

Regular Expressions - User guide


http://www.zytrax.com/tech/web/regex.htm

liz said...

thanks, helped me out a bunch :)

fresh332 said...

I have an output file from a program which contains "\n" characters instead of line breaks, e.g.: "Text\nNew line\nAnother line"

Similar to your "mork" solution I do a consecutive replace, first in "normal mode" replacing "\n" characters with something unique like "ZZZ", then in "extended mode" replacing "ZZZ" with "\n" so I finally have the line breaks.

There should be a way to do this in one step, or to automate the two steps, either in notepad++ or with some other tool - has anyone got an idea?

Mark said...

fresh332, yes there is a way to do this in one step; and no, you cannot do this with Notepad++.

I now use a very powerful text editor called XEmacs. It really leaves Notepad++ for dead when it comes to regexp. It's so good that I'm working on a more detailed guide to regexp using XEmacs right now.

FYI: in XEmacs, you specify a newline character by first pressing Ctrl+Q and then Ctrll+J. This creates a newline character that takes care of \n and other "newline" characters.

Dave Bui said...

Brilliant! I love the replacement double blank lines to a single blank lines.

David Leigh said...

I didn't see a mention of Notepad++'s other Find/Replace facility: The TextFX plugin. I did not look to see if any of the "unsolvable" problems would be solved by TextFX, but in the case that they might be, it's worth looking at the TextFX Find/Replace facility (CTRL+R or via the menus) because of the way it can handle newlines and tabs.

That being said, connecting Find/Replace (any flavor) with the macro recording facility of Notepad++ would elevate this software to "perfect" in my eyes...it's the one thing remaining that really aggrevates me on a semi-regular basis. Other than that, I LOVE this editor.

Anonymous said...
This comment has been removed by a blog administrator.
Jay Fulton said...

Thank you VERY much. The documentation helps you, of course, but it saves time for the rest of us, too! Much appreciated

Ninad said...

Hi,

Can anyone tell me how to use regexp and convert upper case to lower case?

Mark said...

Good question, Ninad. I'm not sure if regexp can change upper to lower case for you. And I'm not sure how complicated your text file is. However, if you simply want to change text to lower case or vice versa, you can do this without using regexp.

In Notepad++, select the text that you would like to change, then click on the TextFX menu, then TextFX Characters, and then select lower case.

Easy, huh?

Anonymous said...

Hi.

Is it possible to search and replace the following in notepad++?

/*
...
...
*/

I can do it it it's all on one line, e.g. /* ... */

But I can't seem to find the regex command to select across multiple lines.

Is this because n++ regex can't handle line returns?

- J

Mark said...

That's exactly right. The problem is the line returns (or newlines). This is quite problematic isn't it?

If you would like to be able to do these types of regular expressions then you should use a more powerful text editor. I use XEmacs.

I've been working on a very comprehensive "Guide to regexp using XEmacs" post for a while now. Hopefully I will publish it in the next month or two.

Jolas Arvin said...
This comment has been removed by the author.
Jolas Arvin said...

i just search the net for multiline regex replacements and i bumped into this post.

im experiencing same problems on n++. poor thing n++ can handle multiline regex. :( oh well im looking forward to see the XEmacs guide to regex. hope multiline regex replacement will be included in it. tnx.

i'm somewhat into coding that feature in java to fully customize regex commands into my needs (specially the multiline replacements). :) if anyone did that, please share. many thanks.. :)

Anonymous said...

Can I do a logical-OR regular expression search in Notepad++? In TextPad I used "^Alert|^Error|^Warning" to find all lines in a system log that started with either of the three words. The "|" operator does not seem to work in Notepad++.
Of course, I could do three separate searches, but it would be nice if NotePadd++ did this for me by interpreting an OR operator, e.g. "|".

Mark said...

No, Notepad++ cannot perform logical OR regexp searches. That was an easy question :)

However, the excellent and free XEmacs can handle your search without any problems. Note that your Textpad OR operator | would become \| in XEmacs, i.e.,
^Alert\|^Error\|^Warning

Anonymous said...

I have a text file full of blocks of text like this:

"STRING1" =>
{ url => "URL1",
visibleif => sub { !$is_temporarily_terminated &&
padlock("STRING2");
},
},
... more blocks like the above separated by a blank row.

End state: I need an excel file with 3 columns: string1, url1, and string2

Any ideas? I am completely new to regex and using notepad++ for now. If someone who is really good at this replies quickly, then there also could be some work that we could pay them to do in the future as we get a lot of projects like this.

Mark Antoniou said...

That's pretty easy to fix. I wouldn't use Notepad++ for this. Instead use the excellent and free XEmacs.
In XEmacs, the correct regex search term would be (newline character at end of each line is made by Ctrl+Q, Ctrl+J):
"\(.*\)".*
.*"\(.*\)".*
.*
.*"\(.*\)".*
.*
.*

and the correct replace term would be:
\1,\2,\3

This would create the following output:
STRING1,URL1,STRING2
which you could then open in Excel as a comma delimited file, which would place each string/url in a separate column.

Abhishek said...

Mark,

One question. The contents of file are following.

ABC
XYZ
123

I want to the file contents to be following.

'ABC','XYZ','123'

Thanks,
Abhishek

Mark Antoniou said...

Hey Abhishek. This is an easy task. I would advise that you use XEmacs rather than Notepad++. The reason for this is that Notepad++ does not deal well with newlines.

In XEmacs, you would search for:
\(.*\)
\(.*\)

and replace this with:
\1,\2

Done :)

Abhishek said...

Thanks Mark. But, I work on client network where we cannot install XEmacs. We have only notepad ++ installed. Any other thoughts please?

ABC
123
XYZ

Need to chnage into 'ABC','123','XYZ'

Mark Antoniou said...

Ok, well there is a way around it, so long as your data is exactly as you have specified here, i.e.:
ABC
XYZ
123

So, in order to get to this:
ABC,XYZ,123
All you need to do is replace the newline character with a comma.

If that is the case, you would use extended search mode and search for: \r\n
and replace this with: ,

That should do the trick.

Vladimir said...

Hi Mark,

Found your blog and hoping you can help me. I have a batch file that I receive daily. I need some help trying to modify it.

I need to insert a page break before it says PAGENO throughout the whole document. I tried to do Find and Replace with PAGENO & \fPAGENO, but it didn't work. It puts FF in black box in front of PAGENO, but doesn't create a page break when I print. What did I do incorrectly and is this the way to do a page break with regexp?

Also, is there a way to automate this process with Notepad++ or any other app?

Thank you very much for your help!

Mark Antoniou said...

Hey Vladimir,
This was a tough one! Let me begin by saying that I have an answer for you... kind of.

First of all, as far as I am aware, you cannot have page breaks in a text document.

Ok, now that we've got that out of the way, what are we going to do to help you? I would say that inserting a page beak requires a rich text editor. So, Notepad++ is not going to cut it.

I have achieved what you requested in one easy step using Microsoft Word. Open your file in Word and select Replace (Ctrl+H), and enter the following search term:

Find what: \fPAGENO
Replace with: ^12

and then hit Replace All.

All of the \fPAGENO are now page breaks. Easy.

If you wish to remove PAGE from the top of each page, you could replace it with nothing. Be sure to match the case when searching so that you do not remove any legitimate occurrences of the word "page" that are in the content of your file (if there are any).

As for automating this, it can be done (although I am not hugely experienced in task automation in Word). Take a look at this URL: http://www.microsoft.com/technet/scriptcenter/resources/qanda/jul07/hey0710.mspx

Good luck. Let me know how it works out.

Puiufly said...

Don't waste time.
move perl.

Mark Antoniou said...

...or you could learn to use Perl, as suggested :)

This is going a bit beyond regexp though!

jp said...

Hi Mark,

I must say its a very useful post.
However i would be very grateful to u if u can solve one of my problems in notepad++.

Input:
{ "arc_on_sf::set_end(...)"
}
25848 0.041144 0.000002 0.1 { "pt_on_sf::evaluate"
}
24408 0.032451 0.000001 0.0 { "pt_on_cv::evaluate"
}

Output: when i place the cursor on any of the open braces and press ctrl-B in a LISP file(got by using alt-l-l enter) i can see the open bracket n the closed bracket highlighted. Now i need a command to delete the text inbetween teh brackets.


for ex: In the above input if I select { "pt_on_cv::evaluate"
} then it should get deleted upon using a shortcut.


so the final output will be
Output:
{ "arc_on_sf::set_end(...)"
}
25848 0.041144 0.000002 0.1 { "pt_on_sf::evaluate"
}
24408 0.032451 0.000001 0.0

Mark Antoniou said...

Thanks for your question jp.

Some more information would be helpful. As your search involves multiple lines, I would strongly recommend using a more powerful text editor than Notepad++. I use XEmacs on Windows and Aquamacs on OSX. The solutions below will work in any text editor that supports multiline regular expressions (not Notepad++).

If you simply want to remove all instances of curly brackets, and everything that is in between them, you would search for:
{.*
}

Note that in Emacs, the way to insert a newline into your search query is to press Ctrl+Q then Ctrl+J. In the above example, you would insert the newline after the asterisk * and before the close curly braces }

and replace this with nothing.

However, I am assuming that you want to keep some of the information in the curly brackets. From your question, I cannot tell if it is every second instance, or curly brackets that contain "cv". Some more information would allow me to give you a more tailored answer. For the time being, I will assume that you want to remove curly brackets containing "cv", but want to leave those containing "sf" (or anything else) unaffected. To accomplish this, you would search for:
{.*cv.*
}

and replace this with nothing.

sourabh bora said...
This comment has been removed by the author.
sourabh bora said...

Hey Mark,
Awesome blog. I could not make {n} (repeats the previous item n times work
Specifically I am looking at deleting a string 10 numbers
Thanks

Mark Antoniou said...

Thanks sourabh bora.

Could you copy and paste a sample from your file so that I can have a look at what patterns might work?

Christopher said...

Wow, this guide is very helpful and makes debugging code or even reformatting jumbled scan text from books a snap to clear up.

Always used Notepass++ and these search and replace tips really makes things so much easier and faster.

sourabh bora said...

Thanks for your reply.
Here is an example:

Post123456 This is a nice post Post12345678 This is not a nice post
Post324567 This is another nice post

I want to delete the "nice" posts (Post--Followed by exactly 6 numbers, )

Thanks

Mark Antoniou said...

This is actually a lot easier than I thought. If the text preceding the 6 numbers is always the same, then you have an easy way of uniquely identifying the "nice" posts.

Search for:
nice post Post......

Replace with: nothing

This will get rid of the words "nice post Post" and the six characters directly after.

sourabh bora said...

Thanks. Unfortunately, no text in the passage is same. The only pattern is
"Post" followed by 6 and exactly six random digits. There can be "Post" followed by 8 or 9 random digits, but they are of no interest to us.
Example

If you are working on something
Post123456 cool, let #delete this
Post123456789 him know.#dont delete
Post234567 They select a #delete
Post1 forum member#dont delete
Post23 each month for a#dont delete
grant of up to $100 in hardware or software or other products. (Products do not have to be available on the mp3Car Store.)

Mark Antoniou said...

Ok, so I didn't understand your previous message properly, then. It still looks to me that there is a pattern there though.

Search for:
Post......

Replace with: nothing

The problem is that if you search for "Post......" it will replace longer strings too, such as "Post12345678" will become "78", and this is not good. So, in order to make it unique, you might include a space after the final period in your search expression.

I will put the search term in quotes to illustrate that there is a space on the end. Do not use the quotes in your text editor -
Search for: "Post...... "

This search term will leave longer strings of numbers unaffected.

Mark Antoniou said...

Here is the output from your sample of text above:

If you are working on something
cool, let #delete this
Post123456789 him know.#dont delete
They select a #delete
Post1 forum member#dont delete
Post23 each month for a#dont delete
grant of up to $100 in hardware or software or other products. (Products do not have to be available on the mp3Car Store.)

sourabh bora said...
This comment has been removed by the author.
sourabh bora said...

Thanks. This is exactly what I did.
However, regexp has a more elegant solution. You can specify exactly how many characters you are searching for.
What if the number of digits was 60 instead of 6? you can write +{60} instead of typing 60 dots.
I was wondering if notepad has this feature implemented.

And also, we need to search only for digits.. so we will have to type [0-9] sixty times. (otherwise, posting123 will be selected)

marius said...

Hy i am new to regular expression
and i don't quite get it. As i do not wont to make a program to replace what i got here, i would like you to help me.

My file is AAABBBCCC etc with all sort of characters from ascii table
the problem is that i whant the text ( code ) to be ABC and search for all hex ascii code not just numbers or letters.

Thanx a lot

Mark Antoniou said...

Thanks for your question Marius. I'm just not exactly clear on what you want to do.

To help me, could you provide me with a sample of what your text looks like (a few lines), and then provide me with what you want those lines to look like after you run the regular expression.

marius said...

well my text looks like aaafffcccddd777gggzzziiippp¶¶¶▬▬▬---000▄▄▄

and i would like all the triplets to be replaced with only one character.

As you can see it is not only a to z and A to Z there are all type of characters with code between 0 and 255 ( Ascii code )

Mark Antoniou said...

Ok. If that is all that your file contains, then you could simply search for:
..(.)

and replace with:
\1

Easy.

Note, I don't use Notepad++ any more, since I have moved on to Emacs. In Emacs the search term would be:
..\(.\)
but the concept is exactly the same: Discard the first two occurrences and keep the third.

marius said...

thank you a lot

Mark N said...

I am trying to do 2 things:

1. Find lines with MORE than 95 characters (including white space)

and

2. 1. Find lines with LESS than 95 characters (including white space)

I can do perl regular expressions, but they just don't work for notepad++ for some reason. Can you please help?.

Afzaal Ameer said...

Hey man as per your wish i have shifted to Xemacs now can you please explain the regex to remove multiline comments

Mark Antoniou said...

Hey Afzaal,
It's very easy with Emacs. You get the newline character by pressing Ctrl+Q Ctrl+J.

For example, if you had two lines and wanted to remove the line break you would

Search for: Ctrl+Q Ctrl+J

Replace with: nothing/leave blank

Mark Antoniou said...

Mark N,
I'm not ignoring you. I've had a bit of trouble getting the regular expression to work in Notepad++. It definitely can be done as a regular expression though.

Must you use Notepad++?

Mark N said...

Well I preffer that it be done in notepad++... besides I don't want to write a script that does this.

Garioch said...

hi, i have a somewhat similar problem ...

i have a sql export-file

i want to "edit" the lines automatically .. coz its almost 6000 of them

each Insert-Line starts with

(id, another_id, third_id, NULL, ...

here i want to "delete" the 3rd id - while leaving all other things

i tried with several search patterns - but to no luck ..

Garioch said...

to be more precise all id , 2nd ID and 3rd ID ar actual numbers

Mark Antoniou said...

Garioch, if you want me to give you the exact answer, oats a few lines of code into a comment. But, the general principle is this:

Group the ids that you want to keep as \1,\2 and don't insert I'd 3 into the replace term. Make sense?

Garioch said...

4 of the lines of those 6000

(1, 1, 1, NULL, 'delayed billing', '2007-02-16', 0, 17 more fields),
(2, 1, 2, NULL, 'delayed billing', '2007-02-16', 0, 17 more fields),
(3, 1, 3, NULL, 'delayed billing', '2007-03-01', 0, 17 more fields),
(4, 1, 4, NULL, 'delayed billing', '2007-03-01', 0, 17 more fields),

since my question only concerns the start of each line i omitted some info at the end ...

but this should give a picture of the data i want to Replace

until now i was able with some info from other web-pages to find the start of a line with a regex like
[(][0-9]*[, ][0-9]*[, ]

this marks exactly (1, 1, from the first insert-line

so how do i "mark" this as pattern 1 and how do i progress from there

Mark Antoniou said...

Sometimes, the best solution is not to get too fancy. How about if we group everything from the start that you want to keep into \1.

Then we group: Id3, NULL.

ThEn we group everything from there to the end of the line .*
as \2.

So, your replace term would be: \1NULL\2
That would work.

Garioch said...

thanks mark

but i think i found "my solution"

Find what :", [0-9]*, NULL, "

Replace with : ", NULL, "

then a quick "Replace All"

but again thanks for you advice (from previous answers)

user said...

Hi Mark is it possible to make something like this, im not a programmer so ill try to explain it easy

find any content between two specific custom tags and replace it with the same tags and a new content between them like

find [customtag]*[customtag]

replace [customtag] This is new content replacing whatever was between custom tags.[customtag]

im using * like a wildcard to explain that should select every single character between tags

and more specific what i want is

find *

replace some html marked text like \\Let change some hmtl paragraphs\\
(ive put slashes mixed with html tags because blogger does not allow me to post those tags)

ive read you cannot use regular with multiline so i ask myself if this is possible in notepad++ in some extent and in multiple opened files simultaneously, preferable as i do all my work with this program, and only xemacs as a last option, or alternative if you want to show next to notepad++ that it is easier to accomplish this in xemacs. But i ask myself if xemacs is not for non programmer ppl like (i know html css and more or less can read php and python with a very rough idea of whats going on, sometimes)

thanks again for this super post the best in internet explaining regular expressions for notepad++ and introducing xemacs for the same.

user said...

(blogger screwed my poorly scaped html tags ill try again with parenthesis)

and more specific what i want is

find (<)!--tag1--(>)*(<)!--tag1--(>)

replace (<)!--tag1--(>)some html marked text like (<)div\(>)(<)p\(>)Let change some hmtl paragraphs(<)/p(>)(<)/div(>)(<)!--tag1--(>)

teddan00 said...

if have a filename i.e. a song called "Born To Run-E Street Band-Bruce Springsteen.mp3"

I try to make "E Street Band-Bruce Springsteen" switch place with "Born To Run".

Find: (.*)-(.*)\.
Replace: \2-\1.

But I get the following filename: "Bruce Springsteen-Born To Run-E Street Band.mp3"

It seems that the last occurrence of "-" is found. is it possible to find the first occurrence, AND still make it compatible with filenames that only have one "-" in it's filename.

TechnologyYogi said...

I used NP++'s regular expressions for find and replace for the first time - successfully, before this I depended on MS SQL Server's Management studio for this, as it has very cool easy to use find/replace features (using regular expressions).

Thanks for the post!

Mark Antoniou said...

First of all, apologies for taking so long to respond. I was on holidays overseas and only recently arrived back in Sydney.

teddan00, I will answer your question first because it is an easy one. If the character "-" is giving you trouble, simply change it to something else via a simple Find+Replace. For instance,

Search for: -
Replace with: mork

Now, run a regular expression like this

Search for: (.*)mork(.*)mork(.*).mp3
Replace with: \2-\3-\1.mp3

For songs with only one "-",

Search for: (.*)mork(.*).mp3
Replace with: \2-\1.mp3

Easy.

Mark Antoniou said...

@user
Thanks for your question. The short answer is "yes", that is exactly what regexp is for.

I couldn't understand your second post, so I will do my best to answer your first post.

Let's say that you had two custom tags and wanted to replace the text between them.

Find: ([customtag1]).*([customtag2])

Replace: \1Type replacement text here\2

The \1 and \2 will re-insert custom tags 1 and 2, respectively back into the text file.

Hope I understood and answered your question.

Der Bloggende Nomade said...

from now on it´s possible (5.7.1) to record search and replace events within a macro.

Tiberius Gracchus said...

There's a very simple workaround for searching multiple lines. Replace \r\n with something that is never present naturally. I like the ANSI character 167, but Notepad doesn't have a facility for inserting ANSI characters easily.

Anyway then you run your search specifying the character or string as your endline equivalent, go to town and replace the puppies with \r\n.

Mark Antoniou said...

Clever workaround. I like it. However, this doesn't address the main reason that forced me to move from Notepad++ to Emacs:
By using a more powerful text editor, workarounds are not required. New line characters can be searched for and/or replaced at will. This simplifies the search and replace expressions and saves me time.

Luc said...

Thank you for the guide!I'm a little confused by what to do in my situation.
I have a file with such a structur:
BEGIN:VCARD
VERSION:2.1
N:Doe;John;;;
FN:John Doe
TEL;CELL;PREF:+41800800800
EMAIL;PREF;WORK:test@blabla.com
ORG:Test
END:VCARD

I want the "FN:" section to be changed in that way: FN: Doe, John (and no more FN: John Doe). Is that possible?

Mark Antoniou said...

Thanks for your question, Luc. Here's the Notepad++ solution:

Search for: (FN:)(.*) (.*)
Replace with: \1 \3, \2

Note that this expression assumes that people only have two names.

Edward said...

Hi,

Is there a way for notepad++ to do an
"or" operation? SOomething like:

find A or B or C

I would especially like this for when
I do a find of all in current document.

Thanks, Ed

Pushkar said...

Hi Mark,

Thanks for the wonderful article, but I still couldn't resolve one of my problems. Could you please tell me how to replace "@#$%" with <@#$%>. Thank you. :) keep up the good work

Shamik said...

Awesome post...kudos for the great work

Mark Antoniou said...

Apologies for taking so long to respond. You all caught me in the midst of a trans-continental move. Now, to your questions:

@Edward: To my knowledge, no.

@Pushkar: Do you literally mean replacing @#$% with <@#$%>? This can be achieved using a simple Find + Replace:

Find: @#$%
Replace with: <@#$%>

If you are talking about some sort of larger-scale find and replace based on some criterion, you need to give me more information, and preferably a snippet of text showing what the text looks like before and what you would like it to look like after.

@Shamik: Glad you liked it :)

Martin said...

to answer the question, "is there anything it can't do"
well look ahead and look behind in regexp fails, and newlines (pretty much anything supported in extended) isn't supported in regexp.
and in case any one is wondering, yes vim supports this just fine.
but I'm still in love with notepad++ because it's just so much more simple to use, but learning vim is still well worth the effort (in my 1st week now and starting to get some real work done with it xD)

but who knows, maybe these issues will get addressed in the next version of notepad++

anyway nice article it did help a little even for an issue that couldn't be fixed in notepad++ xD

e22 said...

If you want to use Notepad++ to do regex over multiple lines simply start off by replacing \r\n with something like !NEWLINE! using the extended settings then do the reverse when finished!

Mark Antoniou said...

Yes, e22, that is what I did in the original post above, though I used a nonsense word "mork" rather than !NEWLINE!

Still though, it is quite unacceptable to me that three steps are required rather than one. And once you start using very complex regular expressions in text files that are hundreds of thousands of lines long, it becomes very tedious to have to worry about whether you missed any of your newly inserted !NEWLINE!s, or if any subsequent expressions modified something in your nonsense word (e.g., if I then got rid of all exclamation marks, it would be hard to go back). My point is that regular expressions are meant to save you time...

Shikhar Kumar said...

nice article, got my work done.

Nico said...

Hello, nice guide.
I have a (newbie) question:
I have the following text:

Minradio#23-567

The result that I want is:

23567

What should be my regexp?

Thanks

Mark Antoniou said...

This is quite a straightforward example, Nico. Haven't had one of these in a while ;)

So we start off with this:
Minradio#23-567

In Notepad++ regular expression search mode,

Search for: .*#(.*)-(.*)
Replace with: \1\2

What you end up with is this:
23567

It might seem a little tricky, but the concept is simple: What information do you want to keep? And how does the other unimportant information border it? In the regexp above, I used the hash (#) and hyphen (-) as anchors. This means that:
a) the text before the hash is free to vary
b) the number of digits between the hash and hyphen are free to vary
c) the number of digits after the hyphen are free to vary.

The limitation is that if some of your lines of text do not contain # or - then it will break my regexp.

Nico said...

Hey Mark, thanks for your help.
Almost worked!!

The result that I've got is

16-103

The "-" was not removed.
Any clue?

Mark Antoniou said...

Make sure that the hyphen is not enclosed within the parentheses.

Nico said...

Hi. Sorry to bother you with this lame question.

That's my string:



What is the "\1\2" that you said to use as replacement?

The "-" never goes away :-/

Mark Antoniou said...

Ok, let's back up a bit. Your original text is this:
Minradio#23-567

You want to keep the numbers, and get rid of whatever is before the numbers as well as the hyphen. So, in Notepad++ regular expression search mode,

Search for: .*#(.*)-(.*)

Let me break down this search term. The first three characters .*# will search for anything until a hash # is found (Minradio# in the above example). We don't put parentheses around this because we don't want to use it in our Replace term; we simply discard it. The next five characters (.*)- will search for anything until a hyphen - is found. The parentheses around the period and asterisk mean that that text (which is in this instance the text immediately after the hash #, that is, the number 23) can be recalled in our Replace term. The way to recall the contents of this first set of parentheses is by typing \1. The hyphen is not enclosed within the parentheses and therefore cannot be recalled in the Replace term; it is simply discarded. Finally, the last four characters (.*) select the remaining text (in this example 567) and the parentheses mean that it can be recalled in the Replace term, this time by \2, because it is the second set of parentheses. So, the Replace term looks like this:

Replace with: \1\2

What you end up with is this:
23567

So, why are you ending up with 23-567? There are a few possiblities:

1. The original text had two hyphens:
Minradio#23--567
If that is the case change your search term to this:
.*#(.*)--(.*)

2. You are including the hyphen within one of the sets of parentheses:
.*#(.*-)(.*)
or
.*#(.*)(-.*)
The hyphen therefore will not be discarded. It will be recalled when you use \1 (top) or \2 (bottom).

3. You are reinserting the hyphen in your Replace term:
Replace with: \1-\2

prozaker said...

you could take a look at the pythonscript plugin, it has a python replace method that everyone could use. It looks complete, textfx or regular n++ regular expression lack options.

http://sourceforge.net/projects/npppythonscript/
--------
editor.pyreplace('id\=\"A\d+\" ','') # delete all id="A##"
------------

el Mauri said...

Hello, nice guide.
I have a (newbie) question:
I have the following list of emails:

aqc25-8@hotmail.com, aoro00@hotmail.com, frojasd08_hotmail.com ... and the list so on

And I want to take with that email that does not comply with the format in a regular email, in my example:

frojasd08_hotmail.com (it hasn't the character @)

Can you help me with the correct regular express to find this pattern?

Thanks, Mauri

Mark Antoniou said...

Mauri, it turns out that this is not as trivial as it first appears. Handling email addresses is quite a controversial issue in the regexp world. See http://www.regular-expressions.info/email.html for a discussion of the varioius issues and disagreements. Your sample text has two unique characteristics that allows us to sidestep the messy world of identifying 'what is an email address?', so I have taken advantage of these two unique conditions:
1. Each email is separated be a comma followed by a space ", "
2. Some of the email addresses are missing a "@"

I have written the solution below for Notepad++. It involves several steps, but as long as conditions 1 and 2 from above are satisfied, it will always work.

So, we start with this:
aqc25-8@hotmail.com, aoro00@hotmail.com, frojasd08_hotmail.com, sam@spouts.com, steve#yahoo.com, ken@jeff.net

Step 1: Place each email address on its own line
Search for (Extended mode): ", " (without the quotation marks)
Replace with: ,\n

You end up with this:
aqc25-8@hotmail.com,
aoro00@hotmail.com,
frojasd08_hotmail.com,
sam@spouts.com,
steve#yahoo.com,
ken@jeff.net

Step 2: Remove correctly formatted emails that contain "@"
Search for (Regular expression mode): .*@.*
Replace with: (nothing, leave blank)

You end up with this:


frojasd08_hotmail.com,

steve#yahoo.com,


Step 3: Remove blank lines
Search for (Extended mode): \n
Replace with: (nothing, leave blank)

The result is this:
frojasd08_hotmail.com,steve#yahoo.com,

Optional step 4: If desired, you could at this point insert a space after each comma
Search for: ,
Replace with: ", " (without quotes)

End result:
frojasd08_hotmail.com, steve#yahoo.com,

So, only those email addresses that do not contain the @ are left, and they may now be corrected, logged, or whatever.

BK said...

I need your help.

I built a reg expression using regmagic tool.

The expression is:

\b(?:(?:[1-9][0-9]{1,3}|[5-9])[0-9]{4}|[0-9]+|[0-9]+)\b

This expression supposed to find numbers between

50000
and
99999999

Here is a sample line that I use to test this regex in notepad++

appstore.gearlive.com/member/76234/|0

I have 1000 lines like this. But despite I check, regular expression as the searchmode, it finds nothing.

What am I missing. Please Help!

Mark Antoniou said...

Yeah, BK, it's not going to happen. Not with Notepad++, at least. From past experience, Notepad++ has problems both with repetition {1} and searching for white space \b.

You could achieve what you want to do in seven (fairly inelegant) steps, starting from the largest number of digits:
[1-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]

and removing one digit an each step
[1-9][0-9][0-9][0-9][0-9][0-9][0-9]

then
[1-9][0-9][0-9][0-9][0-9][0-9]

and so on, until you arrive here
[5-9][0-9][0-9][0-9][0-9]

warm up said...

@Mark,

[5-9][0-9][0-9][0-9][0-9]
this indeed finds what I want. Thank you very much.

Wavetrain said...

Hey, just wanted to say thanks for the pointers. Really helped me clean up a massive wiki list, it probably cut down editing time to 1/4 what it would have been.

warm up said...

I need a regex builder. Will you please suggest me a good one?

Thanks in advance.

Mark Antoniou said...

Sorry warm up, I've never used one, and definitely couldn't recommend a good one. If you've got a specific regexp query I might be of more use.

warm up said...

Thanks for offering your help and your time.

I want to find a string in a text like this;

For example evey line in the text file has a string

#links#
and after this string there are several words that does not interest me. I want to find and mark #links# and the words afterthat so that I can delete them. How can I do that with notepad++?

Mark Antoniou said...

That's not too difficult. You just need to idenitfy each line that begins with #links# and delete it.

Search for: #links#.*
Replace with: nothing, just leave it blank

warm up said...

But I do not want to delete only #links#. I want to delete #links# and the words that are coming afterthat.

forexample let says I have aline like this;

something is important but #links# this is not

after the process I want to get only;

something is important but

Mark Antoniou said...

Yep, I understood what you were after. This still works. Let me break it down for you:

Make sure that you are searching in Regular Expression mode
Search for: #links#.*
Replace with: nothing, just leave it blank

Note that #links# is followed by a period and asterisk .* which will select everything after #links# until the end of the line.

So, when you use that term on this:
something is important but #links# this is not

What will be left over is this:
something is important but

warm up said...

Ok. That works.

Thank you very much.

Constantin said...

Searching for multiple lines doesn't seem to be working.

I am searching for this
@Text.*\r\n.*;

Eclipse has no problem finding it...

Any ideas ?

Mark Antoniou said...

If you are trying to perform this search in Notepad++, it's not going to happen.

Having said that, if you insist on using Notepad++, you are going to need to get creative and will need to break the search down into steps because \r\n cannot be used in Regular Expression mode - you need to use Extended Search mode for \r\n. So, how many steps do you need? I'm not sure, because it depends on your text, but my guess is at least three:

1. Turn the newline into something unique.
2. Run the regexp.
3. Put the newlines back or do something else with them (not sure what, because you didn't specify).

Constantin said...

Well :), if the regular expression implementation in Notepad++ would implement the multi line pattern that could solve it. I am not familiar with how Notepad++ is implemented but Java would allow multi line patterns. I bet .NET would do the same.

The solution you suggested would work nicely but what I was trying to do was to search thru a large set of java files for a certain multi line pattern. So I can't have the option to replace the \r\n with a special token since that will alter the code base.

Thanks for looking!

Mark Antoniou said...

If you do not *have* to use Notepad++, why not just use a more powerful text editor (XEMacs), which will give you the one-line solution that you are looking for?

Dee said...

Hi Mark,
Thanks for very helpful the blog post.. at first I couldn't quite figure out how to use the replace field dynamically.. I had a situation like this:

Text:

Step 1
Step 2
Step 3


Find : Step\s\d
Replace : Step\s\d|

which of course gave me this!

Step\s\d|
Step\s\d|
Step\s\d|

Eventually, it clicked that \1 represents the found patterns
and I stumbled upon this:

Find: Step\s\d
Replace: \1|

which gave me the desired result:

Step 1|
Step 2|
Step 3|

Just wanted to get that out there in case anyone else is struggling with that.

Once again cheers Mark for the help on that one.. the fist in the air celebration was priceless.

Dee

Mark Antoniou said...

Glad that you found the post helpful, Dee. It's funny how it seems so much more simple after you have that "ahah" moment!

Manuel said...

hi,,
i need to do a massive replacement from:

tcp10102/172.20.225.246_PROBE

to


tcp10102_PROBE

can you tell me the syntax to use for this replacement?
text after tcp and text after/ and before _PROBE varies..

Mark Antoniou said...

Hey Manuel,
This is pretty straightforward. You want to keep everything before the forward slash and everything after the underscore.

In Regular Expression search mode,
Search for: (.*)/.*_(.*)
Replace with: \1_\2

You can see that in the search term, I am using the forward slash and underscore as signposts, and am keeping everything before and after (enclosed in parentheses), but am discarding everything in between (not enclosed in parentheses).

Nate said...

I am interested in searching a document and replacing everything from a href=" to " and change all the links quickly with notepad ++ can you tell me how to do this?

I tried searching for ahref=".*" and it selected everything up to the LAST "

Please advise!

Thanks

Mark Antoniou said...

Ok, I'm not sure exactly what you want the end result to be, but I'll give it a go. Say that you start with something like this:

ahref="www.google.com"
ahref="www.facebook.com"
ahref="www.blogger.com"
ahref="www.twitter.com"

If you want to keep the ahref=" and the final " you could

Search for (regexp mode): (ahref=").*(")
Replace with: \1\2

The end result would be

ahref=""
ahref=""
ahref=""
ahref=""

If you want to keep everything but the ahref=" and the final " you could

Search for (regexp mode): ahref="(.*)"
Replace with: \1

The end result would be

www.google.com
www.facebook.com
www.blogger.com
www.twitter.com

If you want to do something else, you're going to have to be more specific. Ideally, show me what a few lines of text look like before, and what you want them to look like after.

Nate said...

Awesome! Thanks for the quick reply, worked great! What an awesome trick for rewriting!

Rakesh Juyal said...

Mark, is it possible to replace all ? in any text file with '${abc' then an incrementing number then '}$'
example:
--------------
where ( col1 = ? or col1 = ? ) and col2 = ?

replaced to

where ( col1 = ${abc1}$ or col1 = ${abc2}$ ) and col2 = ${abc3}$
----------------

Mark Antoniou said...

Yes it is. But it will require a very long and convoluted process and several search and replace steps (similar to the blog post above). The problem is the "increment by one" part.

In Notepad++, you can insert incremented numbers from the Edit | Column Editor menu command. This places numbers at the front of each line.

You could possibly position each ? so that it occurs at the end of each line, then replace it with ${abc\1}$, where \1 represents the number at the beginning of the line.

Not sure if you want to go ahead with this, but if you do, here are the steps:
1. Get rid of all line breaks, replacing them with some unique string that does not occur in your original text file, such as "thereisnoothertextlikethis".
2. Search for ? and replace with a ? followed by a linebreak.
3. Add numbers to the beginning of each line using the Edit | Column Editor menu command.
4. Use a regular expression to search for the number at the beginning of each line and move it to ${abc\1}$
5. Remove all linebreaks.
6. Replace all instances of thereisnoothertextlikethis to restore your original linebreak structure.

If you want to go ahead with this, paste a larger portion of your text file (10-20 lines) and I'll show you how to do it in more detail.

איתי גודאי said...

Hello,

Thank you for the time investing publishing and answering - Helped me a lot ...My Question is :

* If I have Emails with NOT similar text before and After and I would like to extract those Emails...for example :

In the same Text :
First String :
===============
"21-Feb-2011 12:16:49 GMT+02:00 PM","alternateContactBusinessPhone":"","databasePlatform":"","productLine":"Oracle E-Business Suite","lastPublicActivityCreatedBy":"JON.WHELAN@ORACLE.COM","accountStatus":"Active","commitTime":"22-Feb-2011 9:09:06 GMT+02:00 AM","HWCity":"","conflictId":"0","outageType":"","contactLogin":"GOREN.NAAMA@GMAIL.COM","subCategory":"","SRContactEmail":"goren.naama@gmail.com","alertMe":"false","SRContactPhone":"(972) 542-1341 x76"

Second String :
===============
"09-Jun-2011 10:42:57 GMT+03:00 AM","alternateContactBusinessPhone":"","databasePlatform":"","productLine":"Oracle Database Products","lastPublicActivityCreatedBy":"MERCEDES.PORRAS@ORACLE.COM","accountStatus":"Active","commitTime":"10-Jun-2011 10:20:52 GMT+03:00 AM","HWCity":"","conflictId":"0","outageType":"","contactLogin":"ITSHAK@HADASSAH.ORG.IL","subCategory":"","SRContactEmail":"itshak@hadassah.org.il","alertMe":"false","SRContactPhone":"02-6778113"

Regards
Etay G

Mark Antoniou said...

Glad you have found the blog useful, Etay G. I'm not sure exactly what you are trying to get from the text. Do you want to get rid of everything, leaving only the email addresses?

RatA said...

Mark, thanks for the post, is very usefull. following the first example, how about not erasing all the line, but only a part.

like i want to remove the $_POST['abc']; part in all lines

$abc = $_POST['abc'];
$bbb = $_POST['def'];

i try [$_POST].* but it erase all the line, and not the final part.

Mark Antoniou said...

RatA, if I understood correctly, you want to turn this:
$abc = $_POST['abc'];
$bbb = $_POST['def'];

into this:
$abc =
$bbb =

is that right?

To do this,
Search for (regular expression mode): $_POST.*
Replace with: nothing

RatA said...

thanks, u are a genius.

Mikazza said...
This comment has been removed by the author.
Mikazza said...

Hi Mark,

Thanks for all the great info on regular expressions, although I have a problem I can't seem to find the solution for.

I have a data file which I would like to strip out some sections are they are useless, first I replaced all the \r\n with @NEWLINE@ so I could get the whole file in one line, now i'm trying to replace anything between and with

e.g.

**Data I want to keep is here 1**
message
called today but nobody was home
/message
**Data I want to keep is here 2**
message
called today but nobody answered
/message

the words message have < and > around them but the site wont let me post them.

As I said I removed all the line breaks from this and tried to run this regular expression.

Find: (messages.*)(/messages)
Replace: deleted

I couldn't work out how to find the < or > symbols.

I hoped this would delete all the messages and replace them with the word deleted, what it does though is finds the 1st occurance of the word messages then finds the last occurance and replaces everything in between with the word deleted. In my example above its deleting **Data I want to keep is here 2**

Is there any way of doing this using regular expressions?

Mark Antoniou said...

Hi Mikazza, I am not sure that I have understood exactly what you are trying to do, but will give it a shot. So this is your original text:

**Data I want to keep is here 1**
< message >

called today but nobody was home
< /message >

**Data I want to keep is here 2**
< message >

called today but nobody answered
< /message >


In order to remove the < message > and < /message > tags, you should

Search for (regular expression mode): <.*>
Replace with: nothing

This will give you this:

**Data I want to keep is here 1**


called today but nobody was home


**Data I want to keep is here 2**


called today but nobody answered


If you then want to get rid of the lines that begin with "called", you could

Search for (regular expression mode): called.*
Replace with: nothing

which will give you this:

**Data I want to keep is here 1**





**Data I want to keep is here 2**




And then fix the blank lines as you see fit. Hope this helps.

p.s. I inserted spaces before and after the greater and less than symbols so that they would show up in the post. You would not include the spaces in the search term.

Mikazza said...

Thanks for the quick response Mark, what I want to replace is the < message > and < /message > and everything in between them. I can get it to work if there is only one set of these tags in the file (unfortunately there are thousands), if there are more than 1 set it goes wrong and deletes everything between the 1st < message > and the last < /message >.

Since the < message > and < /message > are on different lines in the file and the content between them can also vary on how many lines it's over, I removed all the line breaks to make it a bit easier to do the search and replace.

Please let me know if you need any more information.

Thanks!

Mark Antoniou said...

Ok got it. So, you start of with this:

**Data I want to keep is here 1**
< message >

called today but nobody was home
< /message >

**Data I want to keep is here 2**
< message >

called today but nobody answered
< /message >

Notepad++ has a hard time handling multiline regular expressions. One option is to use a different text editor with more powerful regexp capabilities (ahem, Emacs). The other option is to use Notepad++ and break this down into a few steps (3 to be precise).

Step 1: Remove the newlines

Search for (extended mode): \r\n
Replace with: nothing

This will give you this:
**Data I want to keep is here 1**< message >called today but nobody was home< /message >**Data I want to keep is here 2**< message >called today but nobody answered< /message >

Step 2: Make all instances of < /message > occur at the end of a line. The reason for this is because we want to discard everything before < /message >, apart from that bit at the front that we want to keep.

Search for (extended mode): < /message >
Replace with: \r\n

So, your text will now look like this:
**Data I want to keep is here 1**< message >called today but nobody was home
**Data I want to keep is here 2**< message >called today but nobody answered

We are nearly there, but we still want to discard everything after (and including) the < message > tag.

Step 3: Remove everything from < message > onwards.

Search for (regular expression mode): (.*)< message >.*
Replace with: \1

And finally, we arrive at our desired result:
**Data I want to keep is here 1**
**Data I want to keep is here 2**

Menes said...

Hi Mark i have text like that ;

apple(7)orange(27)banana(318)tulip(2)

And i want to convert it like that;
apple,orange,banana,tulip

i try those ;
[(].*[)] and (\(.*)())
but both of them doesn't work.
Thanks for helping

Mark Antoniou said...

Hi Menes,

This is a little tricky. The reason why is because there are multiple parentheses on the same line. This can muck up your search term. First things first, the way to search for parentheses is with a preceding backslash, like this \( for open and this \) for closed.

One solution for your problem is to take a different approach: rather than trying to take care of all parentheses at once, you could take care of parentheses that contain the same number of digits.

Search for (regular expression mode): \(.\)
Replace with: ,

Search for (regular expression mode): \(..\)
Replace with: ,

Search for (regular expression mode): \(...\)
Replace with: ,

Which will give you this:
apple,orange,banana,tulip,

This, of course, becomes impractical if you have numbers within the parentheses that are range from 1 to 100 digits long. But, as a quick fix, it should be fine for your problem.

Sam said...

hi,
alter database rename file '/fs-a01-a/databases/inv1cn/aggregate_idx-42.dbf' to '/fs-a01-c/databases/inv1cn/aggregate_idx-42.dbf

i want to change this to
alter database rename file '/fs-a01-a/databases/inv1cn/aggregate_idx-42.dbf' to '/fs-a01-c/databases/inv1cn/aggregate_idx-42.dbf';

in last i have to add ';

is this possible ?

Mark Antoniou said...

Hi Sam,
if I understand you correctly, all that you want to do is add an apostrophe and semicolon to the end of the text. This is easily accomplished.

Search for (regexp mode): (.*)
Replace with: \1';

Sam said...

thanks Mark, its working

Adrian981 said...

Hi your guide is amazing.
I was hopeing you could help me with a problem.Here it is : I have one big long line of full names and phone numbers e.g john cruz 00374653 kelly brunz 95847364 alan whirtz 9898372 jane doerl and so on.
I'm trying to get it like this
John cruz 00374653
kelly brunz 95847364
alan whirtz 9898372

ps.ive tried to lookup replace every 3rd space with return button or something along those lines.

Please any help is welcome.
Adrian

Mark Antoniou said...

Glad you found the post helpful, Adrian. I really like your example, because it seems to be very difficult to find a pattern in this seemingly unpredictable series of names and numbers. Some people might have three names (or in the case of Madonna, Pele and so forth, just one), so looking for the third space is not a very foolproof solution. In addition, you could perhaps use the number of digits in a phone number, but this isn't foolproof either as area codes vary in length, as do country codes and so on. You need to think outside the box in order to solve this particular expression. The solution is actually so straightforward that you will probably kick yourself when you see it.

In order to solve this particular problem, it is necessary to take a step back and look at the structure of your text in an abstract way. We need to find not only a pattern in the data that repeats (such as the number of spaces), but one that will allow us to insert a line break so that each name and its corresponding number will occur on the same line. Ideally, we would like to say "wherever there is a string of numbers, turn the next space into a linebreak". It is not possible to do this in one step in Notepad++ (although you could do it in a more powerful text editor, like Emacs). We are going to need 2 steps.

In order to find where each phone number ends, all we have to do is find the number that has a space after it.
Search for (regexp mode): ([0-9]) -note that there is a space after the closed parenthesis
Replace with: \1,

john cruz 00374653,kelly brunz 95847364,alan whirtz 9898372,jane doerl

The reason for inserting a comma is so that we will have something to search for in the next step when we want to insert a new line.

Search for (extended search mode): ,
Replace with: \r\n

which will give you this:
john cruz 00374653
kelly brunz 95847364
alan whirtz 9898372
jane doerl

Adrian981 said...

Thank you Mark it works perfect.
The reason i was asking you about how to make a line after lets say 12 spaces/commas is that i have alot of files that i want break up in lines of 3,6 and 9. I will try give you a good example of what i'm looking to do.eg:
john,likes,this,jane,loves,games,peter,saved,me,george,fell,today,greg,pushed,me

I'm trying to divide them up into lines of 3.
john likes this
jane loves games
peter saved me
george fell today
greg pushed me

The whole long line is made up of 3 words that makes a small sentence.
Thanks for your previous help your amazing and your help would be much appericated for this problem.

Adrian

Mark Antoniou said...

So you start off with this
john,likes,this,jane,loves,games,peter,saved,me,george,fell,today,greg,pushed,me

and you want to put 3 words on each line. Notepad++ makes this a bit harder than it should be. We need 2 steps. First, add a comma to the end of the line so that it looks like this
john,likes,this,jane,loves,games,peter,saved,me,george,fell,today,greg,pushed,me,
If you have hundreds or thousands of lines, you could use a regular expression to do this. Anyway, back to the task at hand.

Search for (regexp mode): ([a-z]*),([a-z]*),([a-z]*),
Replace with: \1 \2 \3QQQ
john likes thisQQQjane loves gamesQQQpeter saved meQQQgeorge fell todayQQQgreg,pushed,me
Note that the QQQ is just a random string that I came up with which (a) will never occur in your list of words, and (b) is easily searchable, which is useful for the next step below.

Search for (extended search mode): QQQ
Replace with: \r\n
john likes this
jane loves games
peter saved me
george fell today
greg pushed me

Adrian981 said...

Hi Mark,

I think you've nearly cracked it for me, just a few more things i left out no thinking it would be an issue.

Some of the words have numerals in them as in john25,left3,

And if i want to divide lines of 5 how do i do that.

Thanks for the very quick response.

Adrian

Mark Antoniou said...

Ok, so let's say you start off with this:
john25,likes,this,jane11,loves,games,peter,saved,me,george,fell,left3,greg,pushed55,me,
and let's assume that you want to group them so that there are 5 words on each line.

Search for (regexp mode): ([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),
Replace with: \1 \2 \3 \4 \5QQQ
which will give you this:
john25 likes this jane11 lovesQQQgames peter saved me georgeQQQfell left3 greg pushed55 meQQQ

Search for (extended search mode): QQQ
Replace with: \r\n
and there you go:
john25 likes this jane11 loves
games peter saved me george
fell left3 greg pushed55 me

Adrian981 said...

Brilliant. It works perfect can you show me how to add to the codes so i can make bigger lines. Lets say lines of 20.

Thank you for your great support.
Please let me know if i can give you a small donation through paypal for you help.

Adrian.

Mark Antoniou said...

Glad you found the blog helpful, Adrian. In order to change the number of words that will end up on each line, simply change the number of ([a-z0-9]*), in the search term, and make sure you have the same number of items in the replacement term.

Using your example of 20, you would
Search for (regexp mode): ([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),([a-z0-9]*),
Replace with: \1 \2 \3 \4 \5 \6 \7 \8 \9 \10 \11 \12 \13 \14 \15 \16 \17 \18 \19 \20QQQ

The obvious limitation of using this sort of brute force approach is that it becomes impractical if you wanted say 1000 words on each line (that would be a lot of copy+pasting!). But, we are trying to work around the limitations of Notepad++, so we have to (sometimes) use inelegant solutions.

As for donations, I gratefully and humbly accept whatever you can spare. My email address for Paypal is markbfm@yahoo.com

Adrian981 said...

Hi
I done a few tests and it only seems to divide up to 9 words per line and any bigger line 10,11,12 it replaces 0.

Any ideas
Adrian

Mark Antoniou said...

Ah, yes, you are right. Notepad++ will not let you have more than 9 bins. Sorry, I was not working in Notepad++ when I posted my previous reply. This is yet another reason to use a more powerful text editor for this sort of advanced regexp. Enough of my ranting.

So, let's say that you want to have more than 9 words per line. It's just a matter of making our bins bigger. Rather than putting one word in each bin, we could put 20 in each bin (or however many you like).

Ok, so we start off with these 40 words:
john25,likes,this,jane11,loves,games,peter,saved,me,george,fell,left3,greg,pushed55,me,john25,likes,this,jane11,loves,games,peter,saved,me,george,fell,left3,greg,pushed55,me,peter,saved,me,george,fell,left3,greg,pushed55,me,too,

Search for (regexp mode): ([a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,[a-z0-9]*,)
Note that there is only one open parenthesis at the start and one closed parenthesis at the end of the search term.
Replace with: \1QQQ

john25,likes,this,jane11,loves,games,peter,saved,me,george,fell,left3,greg,pushed55,me,john25,likes,this,jane11,loves,QQQgames,peter,saved,me,george,fell,left3,greg,pushed55,me,peter,saved,me,george,fell,left3,greg,pushed55,me,too,QQQ

Then use extended search mode to convert the QQQs to newlines (\r\n).

Mark Antoniou said...

Oh, and you could just use a simple Find+Replace to replace the commas with spaces (if you want to).

Adrian981 said...

Thanks alot mark it seems to leave ( at start of each line and ) at end will i just find and replace or am i doing something wrong still ?

Mark Antoniou said...

No, it should not leave any ( or ) anywhere. Make sure that your search term and replace term are correct.

I have double-checked my post above and it is correct. No typos.

Adrian981 said...

Thanks alot Mark you great. Is their any way to remove the , at the end of each line.

I,ve sent you a small donation for your great support.

Thanks again
Adrian

Adrian981 said...

Aswell this code :
Then use extended search mode to convert the QQQs to newlines (\r\n).

I'm tried (\r\n) first thats where i was getting the brackets so i just done it \r\n and it was perfect.

Just looking to remove the comma at the end of each line.

Mark Antoniou said...

Thank you for your support, Adrian.

Sorry if I confused you. I did not mean that the \r\n should be enclosed in parentheses in your replace term. Glad you figured that one out.

If you just want to remove the commas, you could do a simple Find+Replace:
Search for: ,
Replace with:

Adrian981 said...

yes but doing the gets rid of all the commas, i just want to get rid of the commas at the end on each line.

Mark Antoniou said...

Ah, I see. You want to keep the other commas. Well in that case
Search for (regexp mode): (.*),
Replace with: \1

Adrian981 said...

Perfect thanks alot mark.
Adrian

Sam said...

I have pretty small question.
is there anything I can add in front of any word ?..like

b8nmuujs7jrug'
baszp4tj1s7vv'

add ' single quote in front of every word in the line.

thanks
Sam

Mark Antoniou said...

Ok, so you start off with this:
b8nmuujs7jrug'
baszp4tj1s7vv'

and you want to add a single quote ' to the beginning of each line.

Search for (regexp mode): (.*)
Replace with: '\1
which will give you this:
'b8nmuujs7jrug'
'baszp4tj1s7vv'

Sam said...

No worries I figured out the answer

Organix said...

Mark since you seem like the regex master maybe you can point me in the right direction: I have a csv file that has text enclosed in "" but the problem is that in the REMARK/detail field there can be inches which are also using " how can I find these lines with the extra quotations?

example of what I'm looking for:
TRTM_TYPE,TEST_TYPE,RUN_NO,TEST_NUMBER,TOP_DEPTH,BASE_DEPTH,REMARK
FRAC,IP,0,001,1441,1721,"DETAILS: AQUAFRAC 1000; 36750# 20/40 BROWN SD, 7500# 16/30 SIBERPROP"
FRAC,IP,0,001,11218,11346,""
FRAC,IP,0,001,8210,9250,"DETAILS: 60406 GALS WF GR8, 195564 GALS DF 200-R23"
FRAC,IP,0,001,9730,10030,"DETAILS: 51244 GALS WF GR8, 122796 GALS DF 200-R23"
FRAC,IP,0,001,10600,11050,"DETAILS: 27858 GALS WF GR8, 173466 GALS DF 200-R23"
FRAC,IP,0,001,11316,11582,"CMHPG 35#"
FRAC,IP,0,001,6714,7680,"DETAILS: 94 BBLS SLICK WTR, 95 BBLS SLICK WTR, 357 BBLS SLICK WTR, 119 BBLS LIGHTNING 2000 PAD, 0.5 TO 1 PPG 30/50# WHITE SD IN LIGHTNING 2000 GEL, WHITE SD"
FRAC,IP,0,001,7680,8190,"DETAILS: 87 BBLS SLICK WTR, 71 BBLS SLICK WTR, 357 BBLS SLICK WTR, 119 BBLS LIGHTNING 2000 PAD, 0.5 TO 1 PPG 30/50# WHITE SD IN LIGHTNING 2000 GEL, 238 BBLS SLICK WTR, DROP 3" BALL, 168 BBLS SLICK WTR, SEAT BALL, WHITE SD" <-Looking for these kind

Mark Antoniou said...

Thanks for your question, Organix. I normally get asked about changing a text file by restructuring data, but finding text in a particular format can be useful, too. You are interested in an expression that will find text that contains a third " which indicates that the comment includes the inches of some object or action, such as dropping a ball. To find this use the search term below

Search for (regexp mode): .*".*" .*"

Organix said...

Thanks Mark - I'm sure that was an easy one for you. I had tried ".*" .*" and ".*" .*"$ but was getting all the empty strings as well. Thanks again!

Vin said...

Hi Mark.

Just like to thank you for the extremely helpful article even for a newbie like me.

However, I can't seem to figure out how to solve this issue..

7-Jul-09;6-4-12(P:P7A-3A-12),JLN 4/125 ;VANTAGE POINT
3-Sep-09;8-8-7(P:P7B-8-7),JLN 4/125;VANTAGE POINT
1-Oct-09;6-10-07(P:P7A-10-7),JLN 4/125 ;VANTAGE POINT

So, I would like to rid everything within the brackets and keep everything else..

Also, is it possible to sort out the date accordingly?

Thank you so much for your time. Any advice would be much appreciated!

Mark Antoniou said...

Hey Vin,
Getting rid of the information within the parentheses is pretty easy, although getting Notepad++ to recognise that you are loooking for a pernthesis as part of your search term requires that you to precede it with a backslash \( or \)

Search for (regexp mode): (.*)\(.*\)(.*)
Replace with: \1\2

So, you will end up with this.
7-Jul-09;6-4-12,JLN 4/125 ;VANTAGE POINT
3-Sep-09;8-8-7,JLN 4/125;VANTAGE POINT
1-Oct-09;6-10-07,JLN 4/125 ;VANTAGE POINT

I do not understand what you mean by "sort out the date accordingly". Throw me a bone here...

Vin said...

Hi Mark,

Great! That was exactly what I needed, just made my job a breeze! :)

I apologize for the vague question, what I meant to ask was, let's say I have thousands of data all with different dates, and I would like to sort them out from the earliest to latest.

Thank you so much Mark, your help is much appreciated!

Mark Antoniou said...

Oh, I get it now. You are not going to be able do that in a text editor. Perhaps import the text file into Excel, use the text-to-columns feature and specify the comma as your delimeter. Column A will contain all of the dates. Select Column A, set the format of the cells to 'date'. Select the whole data range and sort ascending by column A. That will do it.

Vin said...

Ok! Muchos Gracias. Can't begin to express my gratitude! :)

Manas said...

Thanks man. It was really helpful. I wanted to remove "," that comes in a string from a flat file of 200000+ records. The comma was messing up with my delimiter. BTW I used
[a-zA-Z1-0]+,[a-zA-Z1-0]+ as my search string..
Again thanks a ton man

Jeff said...

Hi,

I'm having a challenge to use regex in Notepad++ for the following case.

Howto find and append a row of hostname and ip address into a one common statement with newline added?

For example:

From
host101 192.168.0.1
host102 192.168.0.2
host103 192.168.0.3

To


As a result,




I really appreciate this very much if someone can shed some lights.

Cheers,
Jeffrey

Mark Antoniou said...

Glad you found the blog useful, Manas.

Jeff, I don't understand what you want to do. What do you want the text to look like at the end?

Jeff said...

Here is some of the input that missed out in my last comments.

To

(hostname host="ipaddress" port="11")(/hostname)

Expected results

(host101 host="192.168.0.1" port="11")(/host101)
(host102 host="192.168.0.2" port="11")(/host102)
(host103 host="192.168.0.3" port="11")(/host103)

Thanks again.

Mark Antoniou said...

Ok, got it. So you start off with this:
host101 192.168.0.1
host102 192.168.0.2
host103 192.168.0.3

Search for (regexp mode): (host.*) (.*)
Replace with: (\1 host="\2" port="11")(/\1)

which will give you this:
(host101 host="192.168.0.1" port="11")(/host101)
(host102 host="192.168.0.2" port="11")(/host102)
(host103 host="192.168.0.3" port="11")(/host103)

Jeff said...

Thanks for your great help, Mark. :-) I'm impressed to use this method to search and append a thousands row of command within few seconds.

This is a wonderful blog of yours that provide expert advice and solution that I'm exactly looking forward to revisit. Wish you do well in life and career.

Cheers,
Jeff

greg.fenton said...

Does N++ regex support doing an arithmetic calculation in the replacement?

For example, I have a serialized object such as:

{s:10:\"abcdefghij\"}

I want to replace "abcde" with "X", so not only do I need to make that change, but I also need to reduce the string size (10) by the replacement length difference (4).

So I'm looking for something like:

Search for:
s:\(\d+\):\\"abcde
Replace with:
s:$((\1 - 4)):"X

where $((\1 - 4)) is an arithmetic calculation whose result is injected in the replacement value.

Possible?

Thanks in advance.

Mark Antoniou said...

Greg, does N++ regex support doing an arithmetic calculation in the replacement?

No. However, depending on the arithmetic, there may be a way to "fake it", and bend Notepad++ to your will. One reader asked me if it is possible to increment numbers in the replace term. It isn't. But if you insert a number on each line using the Notepad++ column editor, then use regexp to restructure the data, the result is identical.

So, where does that leave us then? I am not 100% clear on what your text looks like or what you want it to look like. Like anything, there is a way to do it, but the question is how messy will it get, and is it the most efficient way of getting the job done. It all depends on how repetitious your replace term will be. My gut feeling is that you should probably take a look at either Perl or Awk for your particular case.

Bob C said...

Helped! Thanks :)

Mamoun J. said...

Your work is amazingly good. Recently a hacker injected iframes into my web page for all php files. i'm trying to remove these iframes with notepad++.
what I want to remove is something like:
IFRAME Bla Bla Bla /IFRAME
So, I know the beginning and the end of the string but the problem is the contents are not the same all the time. One thing that I didn't confirm yet is, each iframe is located in a separate line, if so, all what i need is to delete the whole line where i locate iframe term.
please give your suggestions. many thnks.

Mark Antoniou said...

Thanks for your kind words, Mamoun. If all you need to do is remove whatever is contained within the Iframe tags, this can be achieved easily by

Search for (regexp mode): IFRAME.*/IFRAME
Replace with: nothing

If there are multiple instances of IFRAME on the same line, or if an individual IFRAME spans multiple lines, then things become a little more complicated, esp if you're using Notepad++. In this instance, you could change all instances of /IFRAME to something unique, such as ENDOFHACK or whatever. Then you could remove all newlines and replace them with something else unique, such as PUTBACKLATER or whatever. Then you would

Search for (regexp mode): IFRAME.*ENDOFHACK
Replace with: nothing

Then reinsert all newlines back where they were by replacing PUTBACKLATER with \r\n in extended search mode.

Either way, it can be done.

catchthepanda said...

thanks for a good read, breaks down steps very well indeed for doing more complex regex stuff!

I leave you with

your regex power level is over 9000!!!

JPNL said...

Hello Mark, I am trying to replace spaces in string with a comma and figured I need to use a regular expression. I found your post and although it answers a lot, it doesn't help me achieve what I need. Hope you can help!

I have an export file in html from the delicious bookmark site. A part of the html looks like this:

PRIVATE="0" TAGS="marketing sales arrangementen workshops">Jump4art Workshops in Frankrijk

I need to replace the spaces in the 'TAGS' part to make it look like this

PRIVATE="0" TAGS="marketing,sales,arrangementen,workshops">Jump4art Workshops in Frankrijk

A search with RegEx (\TAGS=".*)(">) let's me find and select the entire 'TAGS' string, but I can't find how I can replace the spaces with a comma. Can you help please?

Thanks you!
John-Pierre

Mark Antoniou said...

Hey JP,
Could you paste a few lines so that I can see the other bookmarks too. Could you paste, say about 5?

JPNL said...

Hi Mark, thanks for you quick reply! Here are a few lines. It's just a part of the full string because I can't paste the full html here. I don't see your email address but if you mine you can email me and I reply with an actual example file. Thanks !

PRIVATE="0" TAGS="concerten tickets kaarten">Live Nation Live Nation Netherlands
PRIVATE="0" TAGS="winkelen nagerechten chocolade nougat hapje bonbon">FineFoodImports - Home
PRIVATE="0" TAGS="eigenbedrijf marketing sales,arrangementen workshops">Jump4art
PRIVATE="0" TAGS="ecofriendly bouwen frans">Tulkivi spreksteenkachel
PRIVATE="0" TAGS="hartigetaart vlees Recepten !Recepten">Kerriequiche

Mark Antoniou said...

Ok, thanks for providing the extra info, JP. So, we start off with what you have above. What makes things difficult is the fact that each bookmark has a different number of tags. So in order to get around this, first we will move everything after > to a new line.

Search for (extended search mode): >
Replace with: \r\n>

Which will give you this:

PRIVATE="0" TAGS="concerten tickets kaarten"
>Live Nation Live Nation Netherlands
PRIVATE="0" TAGS="winkelen nagerechten chocolade nougat hapje bonbon"
>FineFoodImports - Home
PRIVATE="0" TAGS="eigenbedrijf marketing sales,arrangementen workshops"
>Jump4art
PRIVATE="0" TAGS="ecofriendly bouwen frans"
>Tulkivi spreksteenkachel
PRIVATE="0" TAGS="hartigetaart vlees Recepten !Recepten"
>Kerriequiche

Now, let's replace all of the spaces within the double quotation marks with commas.

Search for (regexp mode): (PRIVATE="0" TAGS=".*) <--- note that there is a single space after the )
Replace with: \1,

Continue pressing Replace All until there are 0 occurrences that match this regular expression. You will end up with this:

PRIVATE="0" TAGS="concerten,tickets,kaarten"
>Live Nation Live Nation Netherlands
PRIVATE="0" TAGS="winkelen,nagerechten,chocolade,nougat,hapje,bonbon"
>FineFoodImports - Home
PRIVATE="0" TAGS="eigenbedrijf,marketing,sales,arrangementen,workshops"
>Jump4art
PRIVATE="0" TAGS="ecofriendly,bouwen,frans"
>Tulkivi spreksteenkachel
PRIVATE="0" TAGS="hartigetaart,vlees,Recepten,!Recepten"
>Kerriequiche

Now we put the 2 lines that we split up back together again.

Search for (extended search mode): "\r\n
Replace with: "

PRIVATE="0" TAGS="concerten,tickets,kaarten">Live Nation Live Nation Netherlands
PRIVATE="0" TAGS="winkelen,nagerechten,chocolade,nougat,hapje,bonbon">FineFoodImports - Home
PRIVATE="0" TAGS="eigenbedrijf,marketing,sales,arrangementen,workshops">Jump4art
PRIVATE="0" TAGS="ecofriendly,bouwen,frans">Tulkivi spreksteenkachel
PRIVATE="0" TAGS="hartigetaart,vlees,Recepten,!Recepten">Kerriequiche

And there you have it.

eagleapex said...

Just spent 10 minutes trying to make a clever USPTO search with your help.

unparseable (Too Many Search Terms 1043 ) ).

awww

JPNL said...

Wow that worked perfect! Thank you so much!!! and have a nice weekend.

Adrian981 said...

Hi Mark
You gave me a lot of help before with notepadd ++. I was wonder could you help me to figure this out.

i'm looking to replace each comma at the end of a line with someting else.

ie. jon,dan,paul,

I know how to find : (.*),

And i usually replace with : \1


I'm looking to find how to replace with a different word.
Any help is welcome.

Thanks
Adrian

Mark Antoniou said...

Hi Adrian,
Could you show me the before and after so that I know what you want it to look like at the end.

Adrian981 said...

Hey,

Ifigured it out it was pretty simple after all.
\1,mark is what i replace with and it was correct.

Thanks

Helleye said...

Thanks for step 4.
It was very useful for me.

Frank said...

Thanks for this guide and (your even more helpful) answering of questions in the comments

Popsana said...

For such an sql statement as:
(10017, 'com_jublog', 'component', 'com_jublog', '', 1, 1, 0, 0, '{"legacy":false,"name":"com_jublog","type":"component","creationDate":"Mar 2012","author":"JoniJnm","copyright":"","authorEmail":"","authorUrl":"www.jonijnm.es","version":"1.0.1","description":"COM_JUBLOG_XML_DESCRIPTION","group":""}', '{"catid_blogs":"2","catid_pp":"2"}', '', '', 0, '0000-00-00 00:00:00', 0, 0),
(10019, 'themza_j15_14', 'template', 'themza_j15_14', '', 0, 1, 1, 0, '{"legacy":true,"name":"themza_j15_14","type":"template","creationDate":"2008-10-07","author":"Themza Team","copyright":"ThemZa 2008","authorEmail":"templates@themza.com","authorUrl":"http:\\/\\/www.themza.com","version":"1.0.0","description":"Feel the Music","group":""}', '{}', '', '', 0, '0000-00-00 00:00:00', 0, 0),

You Can use:

Regexp mode: ([(])([0-9],*)

Enes said...

Hi Mark,
i wonder that if i have a text like that :

Jessie 213
block me later
iamhere
Jack 232
blablabla
iamhere
blablabla
sometext
againsometext
Mark 30
where
iamhere
...

i want to output only ;
Jessie 213
Jack 232
Mark 30

and as you see we have a trick;
there is a fixed text (iamhere) before 2 lines which we needed.
in fact the question is easy , could we select/mark lines which cames before 2 lines a fixed text.
i looked text-fx but i couldn't solve the problem.
Thanks for helps.

RussiAmore said...

Thank's for such a great guide! There are lot's of tip i didn't know about.

Unknown said...

Mark, I can see why you've received so much traffic on this post. It helped me solve cleaning up a very large xml document. Big Thanks for your Documentation and Examples!!!

Randy
Techie by day, woodworker by night...
http://www.custommade.com/by/repearson

Mark Antoniou said...

Glad you found it helpful, RussiAmore and Randy. And thanks for the kind words.

Enes, your problem is a simple one (in theory), but is made complicated by Notepad++. As you point out, there is a (somewhat) recurring pattern in the text. You want to keep the line that is two lines above "iamhere", discard the line above "iamhere" as well as the "iamhere" itself. I am not going to even bother doing this is Notepad++ because the solution will be very, very, very long. We need a text editor that will allow us to include newlines as part of our regexp search term. I recommend Emacs, available here: http://ftp.gnu.org/gnu/emacs/windows/

So, we start off with this:

Jessie 213
block me later
iamhere
Jack 232
blablabla
iamhere
blablabla
sometext
againsometext
Mark 30
where
iamhere

Search for: \(.*\)
.*
iamhere

Note: insert newline characters into a search term in Emacs by pressing Ctrl+Q Ctrl+J

Replace with: \1

This will give you this:

Jessie 213
Jack 232
blablabla
sometext
againsometext
Mark 30

Ok, so now you can see why I said that there is a *somewhat* recurring pattern. The number of lines between each occurrence of "iamhere" varies, so we want to get rid of "blablabla", "sometext" and "againsometext". In this example, we can use the fact that the unwanted text does not end with a number to our advantage, like this

Search for: .*[a-z]

Note: there is a newline after the [a-z]

Replace with: nothing - leave blank

And there you go:

Jessie 213
Jack 232
Mark 30

Mscarfix said...

Hi Mark: I really appreciate your blog!
Question: I'm working in XML and I want to find all contents between these two tags: caution tags. (Imagine a left and right carrott tag on each caution with verbiage between them. For some reason this blog won't allow carrott tags.)
I can find the tags, now how to I copy that content into a separate file? I know I have about 200 cautions and I want extract only that content to a file. Make sense?

I would appreciate any assistance you can offer, oh "NotePad ++ guru you!

In gratitude,
Mscarfix

Mark Antoniou said...

Do you mean greater than and less than signs?

Could you paste a sample of the code (just a few lines).

Mark Antoniou said...

So, say that you start off with this

get rid of this.this is the stuff that I want to keep*don't want this
don't need this either.I want to keep this stuff too*the stuff here is crap

Search for (regular expression): .*\.(.*)\*.*
Replace with: \1

this is the stuff that I want to keep
I want to keep this stuff too

But, I am not sure what your code looks like, i.e., whether it has these tags <>.

Please note that this is the 200th comment (the most that can be shown on a single Blogger page). Please click the "Next" link below to see newer comments.

«Oldest ‹Older   1 – 200 of 402   Newer› Newest»