20 February 2009

Rearrange text using multiline regular expressions in Emacs

My previous posts on Notepad++ and regular expressions have become very popular. In particular, readers have posed many questions along the lines of "how do I write a regular expression that will..." After having read this post you should be an intermediate regular expression creator. Let's get started.

While I am a fan of Notepad++, it is not powerful enough to perform the regular expressions that I will be going through in this post. I strongly recommend that you download and use the excellent (and free) XEmacs. Installing XEmacs is not as simple as it could be. To simplify the process, I have made my Config files available for download here. Instructions are included.

Regular expression you say, what exactly is a regular expression?
Think of a regular expression as a fancy Find+Replace command that can span across multiple lines and can rearrange bits of text. Using regular expressions involves two steps:
1. Create a search term that finds and selects only the text that you want to modify/move/delete.
2. Create a replace term that outputs the desired text in the correct way.
This is a bit vague, isn't it. Let's look at our first example.

Example 1. Manipulating a simple list of names using Find+Replace
I got married on 11 October (a few months ago) and the photographer gave us a DVD containing all of the photos, approximately 1350 JPEG files. My wife and I were told to choose 120 of these photos to be printed for our wedding album. The DVD contained two folders named 'A' and 'B'. Folder A contained files named DSC_0001 through to DSC_1000, and folder B contained files named DSC_0001 through to DSC_0350. The reason for the existence of two folders was that they came from two memory cards. A thousand photos were taken on card A, which got full, and then a further 350 photos were taken on card B. Now, the problem is that we have 350 overlapping filenames. So we need to change the filenames in some way to make them unique and identifiable. This is relatively easy. Here is what we did.

First, we chose 120 photos. 105 came from folder A. I copied the filenames of the chosen photos from folder A and pasted them into a new text file. It looked like this:

DSC_0004.jpeg
DSC_0007.jpeg
DSC_0015.jpeg
(and so on....)
DSC_0997.jpeg

Before moving on to adding filenames from folder B, I performed a simple find and replace.

Find: DSC_
Replace with: DSCa_

This made the text look like this:

DSCa_0004.jpeg
DSCa_0007.jpeg
DSCa_0015.jpeg
(and so on....)
DSCa_0997.jpeg

We then pasted the filenames from folder B (in pink) into the text file:

DSCa_0004.jpeg
DSCa_0007.jpeg
DSCa_0015.jpeg
(and so on....)
DSCa_0997.jpeg
DSC_0002.jpeg
DSC_0004.jpeg
DSC_0011.jpeg
DSC_0015.jpeg
(etc.)

I then changed the newly pasted names using a simple Find and Replace term:

Find: DSC_
Replace with: DSCb_

That made our text look like this:

DSCa_0004.jpeg
DSCa_0007.jpeg
DSCa_0015.jpeg
(and so on....)
DSCa_0997.jpeg
DSCb_0002.jpeg
DSCb_0004.jpeg
DSCb_0011.jpeg
DSCb_0015.jpeg
(etc.)

This made each file unique and identifiable. I also removed the .jpeg extension to minimise the possibility of confusion.

DSCa_0004
DSCa_0007
DSCa_0015
(and so on....)
DSCa_0997
DSCb_0002
DSCb_0004
DSCb_0011
DSCb_0015
(etc.)

Of course, I could have renamed the filenames in many different ways. For example, I could have changed DSC_0004.jpeg to A\DSC_004, or A_004, or FolderA_004, and so on. The name is not important. What is important is that the names are uniquely identifiable, meaning that no two names are alike. Now, our photographer will not get confused between which DSC_0004.jpeg we want (the one from folder A or B). This is the end of our simple Find+Replace exercise. Let's have a go at a regular expression.

Example 2. A simple Regular Expression: Removing newlines and replacing them with commas
Let's cover some Regular Expression basics. A Regular Expression is used to search for text with a similar pattern, that is, text that matches the search criteria. What makes Regular Expressions so powerful is that there isn't a one-to-one mapping between the regular expression and the text (as there would be in a simple Find+Replace). For example, a period or full stop character . matches any character. So searching for DSC. (note: that is DSC followed by a full stop) would select both DSCa and DSCb, as the full stop could be any character.

Find+Replace is a useful tool. However, there are certain tasks that go beyond the capabilities of simple Find+Replace. Let's assume that we have the list of photos from above.

DSCa_0004
DSCa_0007
DSCa_0015
DSCa_0997
DSCb_0002
DSCb_0004
DSCb_0011
DSCb_0015

I don't want to leave this list as it is (one filename on each line). I want to place the files all on the same line, separated by a comma. This would be the regexp:

Find: newline
Replace with: ,

Note: You do not actually type in the word newline. When we type, every time the Return or Enter key is pressed, a newline character or carriage return is placed on the page (even if you cannot see it on the screen). Given that our filenames above occur on separate lines, there is a newline character after the final digit on each line. To further complicate matters, there are different types of newline characters. Most text editors have a hard time dealing with newline characters (it is precisely for this reason that I have switched from Notepad++ to XEmacs). XEmacs is quite excellent at handling them with minimum fuss. To enter a newline character into your Regular Expression, type Control+Q Control+J, represented in XEmacs as C - q, C - j.

Executing this regexp results in:

DSCa_0004,DSCa_0007,DSCa_0015,DSCa_0997,DSCb_0002,DSCb_0004,DSCb_0011,DSCb_0015

Example 3. A more complex example: Rearranging groups of text onto the same line
Let's assume that my wife went through the list of photos and wrote instructions to the photographer under each name, making the list look like this:

DSCa_0004
Print this one 6*4
DSCa_0007
Print this one 5*7
DSCa_0015
Print this on canvas
DSCa_0997
Print this 6*4
DSCb_0002
Print this one 8*10
DSCb_0004
Can you print this one in matte
DSCb_0011
Make 3 copies of this one
DSCb_0015
Print this one 6*4

I think that the best way to make this clear for the photographer would be to rearrange the text so that it looks like this:

DSCa_0004 = Print this one 6*4
DSCa_0007 = Print this one 5*7
(and so on....)

In order to make the correct regular expression, we must first identify the correct pattern in the text. We cannot simply replace all the newlines as we did in Example 2, because then every photo and comment would be on one very long line. We must place some unique characteristic into our regular expression that will discriminate between which two lines belong together and which do not. In this case, one possibility is by taking advantage of the fact that all photo filenames begin with D, the comments for that photo are on the line immediately below, and no comments begin with D. The regexp would look like this:

Find: \(D.*\) C-q C-j
Replace with: \1 =

Let's examine this regexp. D searches for the letter D. The full stop "." searches for any character. The asterisk after the full stop allows for recursion, meaning that we are searching for any characters after D. The slash and parenthesis around the D.* allow us to save the contents of the regexp and manipulate it in the replace term. The line break, represented by the Ctrl+Q Ctrl+J keyboard command is outside the slashes and parentheses and is discarded.

The replace term takes the D.* (effectively the filename, which is enclosed in slash+parentheses in the search term), adds a space, then an equals sign, then a space. This produces the following output:

DSCa_0004 = Print this one 6*4
DSCa_0007 = Print this one 5*7
DSCa_0015 = Print this on canvas
DSCa_0997 = Print this 6*4
DSCb_0002 = Print this one 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = Print this one 6*4

Example 4. An even trickier example: Rearranging groups of text onto the same line 2
Let's assume that my wife wrote instructions to the photographer under each name, however, some comments began with a D. The list looks like this:

DSCa_0004
Do you think you could print this one 6*4
DSCa_0007
Print this one 5*7
DSCa_0015
Do this on canvas
DSCa_0997
Dave, print this 6*4
DSCb_0002
Print this one 8*10
DSCb_0004
Can you print this one in matte
DSCb_0011
Make 3 copies of this one
DSCb_0015
Print this one 6*4

This time, we cannot use the previous regexp as it does not identify only the photo filenames. It will also select the comments beginning with D. Using the regexp from Example 3 in this case will result in this:

DSCa_0004 = Do you think you could print this one 6*4 = DSCa_0007 = Print this one 5*7
DSCa_0015 = Do this on canvas = DSCa_0997 = Dave, print this 6*4 = DSCb_0002 = Print this one 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = Print this one 6*4

It is a mess. We need to find a pattern in the text that uniquely identifies the filenames only and not the comments. Look at the filenames. All begin with DSC. So, we could change the regexp to find DSC or even DS. No comments begin with DS. Let's give it a shot.

Find: \(DS.*\) C-q C-j
Replace with: \1 =

And here's the output:

DSCa_0004 = Do you think you could print this one 6*4
DSCa_0007 = Print this one 5*7
DSCa_0015 = Do this on canvas
DSCa_0997 = Dave, print this 6*4
DSCb_0002 = Print this one 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = Print this one 6*4

Easy. We could have used \(DSC.*\) as the search term and the result would have been the same.

Example 5. Getting rid of unwanted information.
Let's assume that the photographer receives our list of photos. He wants to see which photos we want to print in what sizes, but he also wants to leave the descriptions for files where no size has been specified. Using regexp, we can remove all info other than the sizes. Here's the file:

DSCa_0004 = Do you think you could print this one 6*4
DSCa_0007 = Print this one 5*7
DSCa_0015 = Do this on canvas
DSCa_0997 = Dave, print this 6*4
DSCb_0002 = Print this one 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = Print this one 6*4

The unique thing that identifies the comments that contains sizes is the asterisk character in the dimensions. We want to keep the asterisk as well as the characters immediately before and after the asterisk (the numbers). And here is the regexp:

Find: \(= \).*\(.\*.\)
Replace with: \1\2

This is the output:

DSCa_0004 = 6*4
DSCa_0007 = 5*7
DSCa_0015 = Do this on canvas
DSCa_0997 = 6*4
DSCb_0002 = 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = 6*4

As you can see, only lines conatining sizes have had their comments removed. Let's go through the search term. The \(= \) searches for the equals sign followed by a space (which denotes where comments begin) and stores it in \1. All text thereafter .* is not kept, until we come across \(.\*.*\) this pattern which is stored in \2. Let's unpack the \(.\*.*\) final part of the search. The first full stop searches for any character, the \* searches for the asterisk (note: the slash before the asterisk is essential as it specifies that we are searching for an asterisk * character, and not a recursive .*), and the last full stop and asterisk allows any character after the asterisk.

Example 6: Rearranging information.
Our photographer has decided that he would like to display the sizes of our photos on the left and the filenames on the right of the equals sign. Photos containing comments only should remain as they are. Here is the text:

DSCa_0004 = 6*4
DSCa_0007 = 5*7
DSCa_0015 = Do this on canvas
DSCa_0997 = 6*4
DSCb_0002 = 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = 6*4

Again the asterisk is our unique identifier of the sizes. Here is the regexp:

Find: \(.*\)\(= \)\(.\*.*\)
Replace with: \3 \2\1

And here is the output:

6*4 = DSCa_0004
5*7 = DSCa_0007
DSCa_0015 = Do this on canvas
6*4 = DSCa_0997
8*10 = DSCb_0002
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
6*4 = DSCb_0015

Our photographer could have done this at the beginning of Example 5 (if he wanted to). Here is the text from Example 5:

DSCa_0004 = Do you think you could print this one 6*4
DSCa_0007 = Print this one 5*7
DSCa_0015 = Do this on canvas
DSCa_0997 = Dave, print this 6*4
DSCb_0002 = Print this one 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = Print this one 6*4

And here is the regular expression that will (a) get rid of all info apart from sizes, and (b) rearrange the order of the text to size = filename:

Find: \(.*\)\(= \).*\(.\*.*\)
Replace with: \3 \2\1

And here is the output:

6*4 = DSCa_0004
5*7 = DSCa_0007
DSCa_0015 = Do this on canvas
6*4 = DSCa_0997
8*10 = DSCb_0002
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
6*4 = DSCb_0015

As you can see, the two outputs are identical.

These are only a few examples of how regular expressions can be used. I hope that these examples will be useful for you and will provide some guidance and stimulation about what is possible with regexp. If you have any questions, please post them in the comments below.

22 comments:

xxx said...

How can i do in notepad++ something like this:

i have:

IF xx is e and....
IF zx is w and....
empty line
IF xt is f and....
IF nx is b and....
empty line
empty line
IF xt is f and....
IF nx is b and....


and i would like to have:

R1: IF xx is e and....
R2: IF zx is w and....
empty line
R3: IF xt is f and....
R4: IF nx is b and....
empty line
empty line
R5: IF xt is c and....
R6: IF nx is y and....

thx for helping me...

Mark Antoniou said...

Notepad++ cannot handle this regular expression in one step. You can achieve what you want, but it will take a few steps. So, we start off with this:

IF xx is e and....
IF zx is w and....

IF xt is f and....
IF nx is b and....


IF xt is f and....
IF nx is b and....

Step 1: Get rid of the empty lines
Search (Extended) for: \r\n\r\n
Replace with \r\n

Note that if you have two empty lines in a row, you would add another \r\n\ and search for \r\n\r\n\r\n, and so on.
So, now we have this:

IF xx is e and....
IF zx is w and....
IF xt is f and....
IF nx is b and....
IF xt is f and....
IF nx is b and....

Step 2: Add incremented numbers to the front of each line.
Click Edit | Column Editor (shortcut Alt+C)
and then click "Number to insert" and set the value of "Initial number to "1" and "Increase by" to "1", and then click Ok.
So, now we have this:

1IF xx is e and....
2IF zx is w and....
3IF xt is f and....
4IF nx is b and....
5IF xt is f and....
6IF nx is b and....

Step 3: Use regular expression to add R and : and correct spacing.
Search (Regular expression) for: (.)IF
Replace with: R\1: IF
And you end up with your desired outcome (except that the emtpy lines are gone):

R1: IF xx is e and....
R2: IF zx is w and....
R3: IF xt is f and....
R4: IF nx is b and....
R5: IF xt is f and....
R6: IF nx is b and....

With Emacs, I could have given you a single step solution, but that's the price you pay for using Notepad++ ;)

Kusin Knase said...

Could you hint on why "Most text editors have a hard time dealing with newline characters"? I find this deficiency very surprising. A text document is form the viewpoint of the editor just a string of characters, and newline is just another character (or two, depending on the platform). Why would this be so troublesome. I don't get it?

Mark Antoniou said...

@Kusin Knase
This is going a little bit beyond my expertise, but I believe that part of the problem is that over time and across operating systems, a variety of characters have been used for newline (or line break, or line return, etc.). For a discussion, see http://en.wikipedia.org/wiki/Newline, particularly under the "Unicode" section of that Wikipedia entry.

So, I guess that a text editor such as Emacs is better at handling newlines because it recognises more of the newline characters than Noptepad++ does. This still does not account for why Notepad++ requires an extended search mode - my guess is that to do away with extended search mode would require a significant rewrite of the source code, and the programmers do not consider it to be a priority.

Sahara Kite said...

Hi Mark,
thanks for the post it is very very useful, however I can't find my way on doing this:

I have many lines like these:

1060010000011103011615000380012000380012226600036300000000000000000000036303270327000000036000000001244000790722
48001571000180662
1060010000021103012615000065009000065009555500065400000000000000000000065400180636000000636000012701244000808451
48001710000010661
1060010000031103011115000150010000150010336600047200000000000000000000047204720472000000000000006001110001209665
48001431110040662
1060010000041103011615000167510000167510226600064100000000000000000000064105560556008500000000009351110001594432
48001361134038663

what I need (1st) is that the line starting with 4 has to follow the line starting with 1, like this:

10600100... 48001571000180662

but I need an identifier before the 4 to know where it starts.
Then I need (2nd) to add something(eg. a comma) at determinate places (this is census data so the first position means level1; the position 2-3 the region; the positions 4-5-6 the city; etc.) so as to be able to differentiate the variables in columns when importing them to Excel.

I have figured out how to do the 1st part as it is similar to one of your examples, but I cannot find how to add commas at determinate positions.

Note: I use Notepad++ and tried UltraEdit (the free version for 30 days) as I couldn't manage to run your examples in Emacs (it just doesn't work or I don't know how to make it work... the program starts OK but when I type your regex expressions with your data it says "couldn't find ..."). For the 1st part I have "translated" your regex expressions to UltraEdit ones and it worked (although after a lot of trial and error).

Many thanks in advance and congrats for your blog.

Nico

Sahara Kite said...

I forgot to mention: I am working on a Windows 7 machine.

Mark Antoniou said...

Ok, so it seems that you want to do quite a few manipulations. I am not sure if I have understood them all, but will do my best. First thing is to get the number beginning with 1 and the number beginning with 4 on the same line. I have chosen to use a comma as the character between the two large numbers. I did this in Emacs using the Replace Regexp command, although you could probably do it in UltraEdit if you find it easier. Note that the newline in the search term is entered by pressing Control+Q Control+J.

Search for: \(1.*\)
\(4.*\)
Replace with: \1,\2
This will give you this

1060010000011103011615000380012000380012226600036300000000000000000000036303270327000000036000000001244000790722,48001571000180662
1060010000021103012615000065009000065009555500065400000000000000000000065400180636000000636000012701244000808451,48001710000010661
1060010000031103011115000150010000150010336600047200000000000000000000047204720472000000000000006001110001209665,48001431110040662
1060010000041103011615000167510000167510226600064100000000000000000000064105560556008500000000009351110001594432,48001361134038663

Re: adding commas at specific positions, you have not provided enough information for me to provide you with a foolproof solution. I have assumed that the commas are to be inserted in the first number, which begins with a 1. If the number of characters is always the same, you could just specify the number of characters using the period regexp "." which matches any character. So if digit 1 is the level, you could use one period to match this, if 2-3 are the region you could use two periods, and 4-6 are the city you could use 3 periods. And if this is pattern is identical for all lines, you could use something like the following regexp

Search for: \(.\)\(..\)\(...\)\(.*\)
Replace with: \1,\2,\3,\4
which will give you this. Note that the commas have been inserted at the start of each line.

1,06,001,0000011103011615000380012000380012226600036300000000000000000000036303270327000000036000000001244000790722,48001571000180662
1,06,001,0000021103012615000065009000065009555500065400000000000000000000065400180636000000636000012701244000808451,48001710000010661
1,06,001,0000031103011115000150010000150010336600047200000000000000000000047204720472000000000000006001110001209665,48001431110040662
1,06,001,0000041103011615000167510000167510226600064100000000000000000000064105560556008500000000009351110001594432,48001361134038663

For each line, 1 is the level, 06 is the region, 001 is the city and so forth. Then, at the end of the line is another comma with the number beginning with 4 that was originally on the line below.

If this is correct, then great. If it is not correct, the best way to help me help you would be to show me exactly what the text should look like once you are done.

Sahara Kite said...

Hi Mark, thank you so much for your extremely quick response. Yes you got it right, what I need in the end is this:

1,06,001,000001,11,03,0,1,1,6,1,5,0003800,12,0003800,12,2,2,6,6,000363,000000,000000,000000,000363,0327,0327,0000,00036,00000000,1244,000790722,4,800,1,57,1,000,180,6,6,2

so that it can be easily imported to Excel, each comma used to separate columns.

The expression for that would be the same as you wrote but adding up to 32 \(\); i.e. \(.\)\(..\)\(...\)\(......\)\(..\)\(..\)\(.\)\(.\)\(.\)\(.\)\(.\)\(.\)\(.......\)\(..\)\(.......\)\(..\)\(.\)\(.\)\(.\)\(.\)\(......\)\(......\)\(......\)\(......\)\(......\)\(....\)\(....\)\(....\)\(....\)\(........\)\(....\)\(.........\)\(.*\)

And then the replace:
\1,\2,\3,\4,\5,\6,\7,\8,\9,\10,\11,\12,\13,\14,\15,\16,\17,\18,\19,\20,\21,\22,\23,\24,\25,\26,\27,\28,\29,\30,\31,\32

Is that correct? Can Emacs handle so many \(\)?

The reason of using UltraEdit instead of Emacs is just because I didn't manage to find how to do that in Emacs. When I write down the expressions (at the bottom of the Emacs window) it keeps saying that nothing is found... I am probably doing something wrong.
Did you finally post the 'Guide to using regular expressions with Emacs' in your blog? I was unable to find it. I wish I could use Emacs as the regular expressions are more straight forward than UE ones, plus UE help says: "A regular expression may have up to 9 tagged expressions" so I cannot perform the 32 tagged expressions mentioned before.

Just in case, do you know of a website where one can learn how to use Emacs? Maybe you know of an alternate software (more user friendly than Emacs, like Notepad++) which can use the Emacs regular expressions.
I guess it's a stupid question but, can I perform what I need in Notepad++? (from what I have read in your blog, the response is no, or at least not in one unique expression).

Thanks again for your time. You are helping me a lot!

Nico

Mark Antoniou said...

This *is* my post on how to use Emacs! Haha. I suggest starting with simple expressions and building your way up. You could even split up your 42 (not 32) element expression in Notepad++ if you wanted to. You could break this down into multiple steps where you do buffers 1 to 9 in one regexp, then 10 to 18 in another and so on. You could insert a strange character at the end of each group of 9 buffers, such a X, and use that to pick up where you left off, i.e., (.*)X(.whatever...

That would be the way to do it in Notepad++.

Sahara Kite said...

Hi again,
I am struggling with this problem:

I have this:
105001000001080113061400235001
27810000000100
27830000000100
105001000002080111161500001200
27810000000100
27830000000100
105001000003080111161500008600
27810000000100
27830000000100
105001000004080111161500009600
27810000000100
105001000005080111161500015001
27810000000100
27830000000100
105001000006080111161500024201
27510000000100
27630000000100
27830000000100
27850000000100
105001000007080111161400022001
27810000000100
27830000000100
105001000008080111161400020301
27810000000100
27830000000100
27850000000100
105001000009080111161500008700
27810000000100
27830000000100
105001000010080111161700001300
105001000011080111161500060001
105001000011080111161500060001
27530000000100
27830000000100
27850000000100
27900000000100
etc

And I need to have this:

105001000001080113061400235001x27810000000100x27830000000100
105001000002080111161500001200x27810000000100x27830000000100
105001000003080111161500008600x27810000000100x27830000000100
105001000004080111161500009600x27810000000100
105001000005080111161500015001x27810000000100x27830000000100
105001000006080111161500024201x27510000000100x27630000000100x27830000000100x27850000000100
105001000007080111161400022001x27810000000100x27830000000100
105001000008080111161400020301x27810000000100x27830000000100x27850000000100
105001000009080111161500008700x27810000000100x27830000000100
105001000010080111161700001300
105001000011080111161500060001
105001000011080111161500060001x27530000000100x27830000000100x27850000000100x27900000000100

This is census data, the row 1 (starting with 1) has locality ID and the rows starting with 2 agricultural data. To know to which locality the agricultural data is from I need to have agricultural rows following the locality ID.
As you can see the problem is that there can be from none up to 7 (not shown in this example) rows starting with 2.

I have tried many times but don't get how to do that with notepad ++ regex.

Thanks again for helping me.

Mark Antoniou said...

All that you want to find is whenever a 2 occurs after a newline, and then replace that newline with an x. In Notepad++, you search for newlines in Extended Search mode using \r\n. So, in order to find all of the 2s that occur after a newline, you would

Search for (Extended Search mode): \r\n2
Replace with: x2

which will give you this:

105001000001080113061400235001x27810000000100x27830000000100
105001000002080111161500001200x27810000000100x27830000000100
105001000003080111161500008600x27810000000100x27830000000100
105001000004080111161500009600x27810000000100
105001000005080111161500015001x27810000000100x27830000000100
105001000006080111161500024201x27510000000100x27630000000100x27830000000100x27850000000100
105001000007080111161400022001x27810000000100x27830000000100
105001000008080111161400020301x27810000000100x27830000000100x27850000000100
105001000009080111161500008700x27810000000100x27830000000100
105001000010080111161700001300
105001000011080111161500060001
105001000011080111161500060001x27530000000100x27830000000100x27850000000100x27900000000100

Pretty easy, huh!

Kheirul Nazib said...

hi - i'm new with notepad++ - need some REPLACE help...

say i have sets of numbers ID

a1234
b1234
c1234
d1234

and

a1234=aaaa
b1234=bbbb
c1234=cccc
d1234=dddd

question - how to replace all using command

thank upi

Mark Antoniou said...

Hi Kheirul,

If you start off with

a1234
b1234
c1234
d1234

and you want to add an equals sign and the first character four times, you would

Search for (regular expression mode): (.)(.*)
Replace with: \1\2=\1\1\1\1

and that will give you this:

a1234=aaaa
b1234=bbbb
c1234=cccc
d1234=dddd

In Emacs, it would be the same, except that the search term would be

Search for: \(.\)\(.*\)

Simon said...

Hi Marc,
thanks for this guide!
Perhaps you can help me with this:

I wish to find and replace lines containg the folder name 'Catchments' and the subfolder name 'Current', e.g.

Catchments\Baro North\Current
to
Catchments\Baro North\2020

or

Catchments\Dider\Current
to
Catchments\Dider\2020

(there are other lines in my file containing either 'Catchments' or 'Current', but these should be left out)

Cheers,
Simon

Mark Antoniou said...

Hi Simon,
This is a relatively straightforward regular expression. You haven't specified which text editor you are using, so I will assume it is Emacs.

Search for: \(Catchments.*\)Current
Replace with: \12020

Unknown said...

I want to replace all the text between two words with a blank. Can you please help me with it.
The contents between the two words arent always the same.
For eg:
A and B are the two words between which I want to replace all characters.
Line 1: "A abc hijk ... B" should become: AB
Line 2: "A pqr fcsdh .. B"
should become AB

Mark Antoniou said...

Here you go disha

Search for: \(A\).*\(B\)
Replace with: \1\2

Unknown said...

Hi Mr. Antoniou,

I just wanted to say thank you. I am doing some simple find/replace commands across multiple (25) .html files using Notepad++, but those find/replaces contain carriage returns and newlines so I needed to mess with regular expressions. Looking at your blog posts helped me to figure out how to do what was needed even though I am not a programmer. Thank you for breaking some of these concepts down such that a layman can understand and use them. You saved me a couple hours of tedious typing. Thanks again.

=)

-Tripleguess

Unknown said...

I have file in which address is coming in multiple line. for example

"9 Raffles Place, #20-20
Republic Plaza II
Singapore 048619 Singapore 048619 Singapore"

But they are always within quotes.
I need to try 2 things which ever is possible.
Either remove the new line whichever is present in quotes or remove the entire string and replace with blank or something a constant string.
There is possibility that there are multiple such addresses.

Could you please help me with the regular expression.

Mark Antoniou said...

Hi Amir,
I assume that you're using Emacs, in which case, this is very easy to take care of.

Search for: \(".*\)
\(.*\)
\(.*"\)

Replace with: \1 \2 \3

Which will give you this: "9 Raffles Place, #20-20 Republic Plaza II Singapore 048619 Singapore 048619 Singapore"

duceduc said...

Hello. I want to learn about regex expression and have been using Notepad++ just general usage at the moment. You mentioned xemarcs and have included your config files. I have read the instructions within the folder and have copy the two folders to my documents folders. What is the next step and how do I run it? Thanks.

Layarion said...

I just want to know how to use notepad++, not this other crap.

could you tell me what flavor NotePad++ uses so I may actually lookup this stuff?