20 February 2009

Rearrange text using multiline regular expressions in Emacs

My previous posts on Notepad++ and regular expressions have become very popular. In particular, readers have posed many questions along the lines of "how do I write a regular expression that will..." After having read this post you should be an intermediate regular expression creator. Let's get started.

While I am a fan of Notepad++, it is not powerful enough to perform the regular expressions that I will be going through in this post. I strongly recommend that you download and use the excellent (and free) XEmacs. Installing XEmacs is not as simple as it could be. To simplify the process, I have made my Config files available for download here. Instructions are included.

Regular expression you say, what exactly is a regular expression?
Think of a regular expression as a fancy Find+Replace command that can span across multiple lines and can rearrange bits of text. Using regular expressions involves two steps:
1. Create a search term that finds and selects only the text that you want to modify/move/delete.
2. Create a replace term that outputs the desired text in the correct way.
This is a bit vague, isn't it. Let's look at our first example.

Example 1. Manipulating a simple list of names using Find+Replace
I got married on 11 October (a few months ago) and the photographer gave us a DVD containing all of the photos, approximately 1350 JPEG files. My wife and I were told to choose 120 of these photos to be printed for our wedding album. The DVD contained two folders named 'A' and 'B'. Folder A contained files named DSC_0001 through to DSC_1000, and folder B contained files named DSC_0001 through to DSC_0350. The reason for the existence of two folders was that they came from two memory cards. A thousand photos were taken on card A, which got full, and then a further 350 photos were taken on card B. Now, the problem is that we have 350 overlapping filenames. So we need to change the filenames in some way to make them unique and identifiable. This is relatively easy. Here is what we did.

First, we chose 120 photos. 105 came from folder A. I copied the filenames of the chosen photos from folder A and pasted them into a new text file. It looked like this:

DSC_0004.jpeg
DSC_0007.jpeg
DSC_0015.jpeg
(and so on....)
DSC_0997.jpeg

Before moving on to adding filenames from folder B, I performed a simple find and replace.

Find: DSC_
Replace with: DSCa_

This made the text look like this:

DSCa_0004.jpeg
DSCa_0007.jpeg
DSCa_0015.jpeg
(and so on....)
DSCa_0997.jpeg

We then pasted the filenames from folder B (in pink) into the text file:

DSCa_0004.jpeg
DSCa_0007.jpeg
DSCa_0015.jpeg
(and so on....)
DSCa_0997.jpeg
DSC_0002.jpeg
DSC_0004.jpeg
DSC_0011.jpeg
DSC_0015.jpeg
(etc.)

I then changed the newly pasted names using a simple Find and Replace term:

Find: DSC_
Replace with: DSCb_

That made our text look like this:

DSCa_0004.jpeg
DSCa_0007.jpeg
DSCa_0015.jpeg
(and so on....)
DSCa_0997.jpeg
DSCb_0002.jpeg
DSCb_0004.jpeg
DSCb_0011.jpeg
DSCb_0015.jpeg
(etc.)

This made each file unique and identifiable. I also removed the .jpeg extension to minimise the possibility of confusion.

DSCa_0004
DSCa_0007
DSCa_0015
(and so on....)
DSCa_0997
DSCb_0002
DSCb_0004
DSCb_0011
DSCb_0015
(etc.)

Of course, I could have renamed the filenames in many different ways. For example, I could have changed DSC_0004.jpeg to A\DSC_004, or A_004, or FolderA_004, and so on. The name is not important. What is important is that the names are uniquely identifiable, meaning that no two names are alike. Now, our photographer will not get confused between which DSC_0004.jpeg we want (the one from folder A or B). This is the end of our simple Find+Replace exercise. Let's have a go at a regular expression.

Example 2. A simple Regular Expression: Removing newlines and replacing them with commas
Let's cover some Regular Expression basics. A Regular Expression is used to search for text with a similar pattern, that is, text that matches the search criteria. What makes Regular Expressions so powerful is that there isn't a one-to-one mapping between the regular expression and the text (as there would be in a simple Find+Replace). For example, a period or full stop character . matches any character. So searching for DSC. (note: that is DSC followed by a full stop) would select both DSCa and DSCb, as the full stop could be any character.

Find+Replace is a useful tool. However, there are certain tasks that go beyond the capabilities of simple Find+Replace. Let's assume that we have the list of photos from above.

DSCa_0004
DSCa_0007
DSCa_0015
DSCa_0997
DSCb_0002
DSCb_0004
DSCb_0011
DSCb_0015

I don't want to leave this list as it is (one filename on each line). I want to place the files all on the same line, separated by a comma. This would be the regexp:

Find: newline
Replace with: ,

Note: You do not actually type in the word newline. When we type, every time the Return or Enter key is pressed, a newline character or carriage return is placed on the page (even if you cannot see it on the screen). Given that our filenames above occur on separate lines, there is a newline character after the final digit on each line. To further complicate matters, there are different types of newline characters. Most text editors have a hard time dealing with newline characters (it is precisely for this reason that I have switched from Notepad++ to XEmacs). XEmacs is quite excellent at handling them with minimum fuss. To enter a newline character into your Regular Expression, type Control+Q Control+J, represented in XEmacs as C - q, C - j.

Executing this regexp results in:

DSCa_0004,DSCa_0007,DSCa_0015,DSCa_0997,DSCb_0002,DSCb_0004,DSCb_0011,DSCb_0015

Example 3. A more complex example: Rearranging groups of text onto the same line
Let's assume that my wife went through the list of photos and wrote instructions to the photographer under each name, making the list look like this:

DSCa_0004
Print this one 6*4
DSCa_0007
Print this one 5*7
DSCa_0015
Print this on canvas
DSCa_0997
Print this 6*4
DSCb_0002
Print this one 8*10
DSCb_0004
Can you print this one in matte
DSCb_0011
Make 3 copies of this one
DSCb_0015
Print this one 6*4

I think that the best way to make this clear for the photographer would be to rearrange the text so that it looks like this:

DSCa_0004 = Print this one 6*4
DSCa_0007 = Print this one 5*7
(and so on....)

In order to make the correct regular expression, we must first identify the correct pattern in the text. We cannot simply replace all the newlines as we did in Example 2, because then every photo and comment would be on one very long line. We must place some unique characteristic into our regular expression that will discriminate between which two lines belong together and which do not. In this case, one possibility is by taking advantage of the fact that all photo filenames begin with D, the comments for that photo are on the line immediately below, and no comments begin with D. The regexp would look like this:

Find: \(D.*\) C-q C-j
Replace with: \1 =

Let's examine this regexp. D searches for the letter D. The full stop "." searches for any character. The asterisk after the full stop allows for recursion, meaning that we are searching for any characters after D. The slash and parenthesis around the D.* allow us to save the contents of the regexp and manipulate it in the replace term. The line break, represented by the Ctrl+Q Ctrl+J keyboard command is outside the slashes and parentheses and is discarded.

The replace term takes the D.* (effectively the filename, which is enclosed in slash+parentheses in the search term), adds a space, then an equals sign, then a space. This produces the following output:

DSCa_0004 = Print this one 6*4
DSCa_0007 = Print this one 5*7
DSCa_0015 = Print this on canvas
DSCa_0997 = Print this 6*4
DSCb_0002 = Print this one 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = Print this one 6*4

Example 4. An even trickier example: Rearranging groups of text onto the same line 2
Let's assume that my wife wrote instructions to the photographer under each name, however, some comments began with a D. The list looks like this:

DSCa_0004
Do you think you could print this one 6*4
DSCa_0007
Print this one 5*7
DSCa_0015
Do this on canvas
DSCa_0997
Dave, print this 6*4
DSCb_0002
Print this one 8*10
DSCb_0004
Can you print this one in matte
DSCb_0011
Make 3 copies of this one
DSCb_0015
Print this one 6*4

This time, we cannot use the previous regexp as it does not identify only the photo filenames. It will also select the comments beginning with D. Using the regexp from Example 3 in this case will result in this:

DSCa_0004 = Do you think you could print this one 6*4 = DSCa_0007 = Print this one 5*7
DSCa_0015 = Do this on canvas = DSCa_0997 = Dave, print this 6*4 = DSCb_0002 = Print this one 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = Print this one 6*4

It is a mess. We need to find a pattern in the text that uniquely identifies the filenames only and not the comments. Look at the filenames. All begin with DSC. So, we could change the regexp to find DSC or even DS. No comments begin with DS. Let's give it a shot.

Find: \(DS.*\) C-q C-j
Replace with: \1 =

And here's the output:

DSCa_0004 = Do you think you could print this one 6*4
DSCa_0007 = Print this one 5*7
DSCa_0015 = Do this on canvas
DSCa_0997 = Dave, print this 6*4
DSCb_0002 = Print this one 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = Print this one 6*4

Easy. We could have used \(DSC.*\) as the search term and the result would have been the same.

Example 5. Getting rid of unwanted information.
Let's assume that the photographer receives our list of photos. He wants to see which photos we want to print in what sizes, but he also wants to leave the descriptions for files where no size has been specified. Using regexp, we can remove all info other than the sizes. Here's the file:

DSCa_0004 = Do you think you could print this one 6*4
DSCa_0007 = Print this one 5*7
DSCa_0015 = Do this on canvas
DSCa_0997 = Dave, print this 6*4
DSCb_0002 = Print this one 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = Print this one 6*4

The unique thing that identifies the comments that contains sizes is the asterisk character in the dimensions. We want to keep the asterisk as well as the characters immediately before and after the asterisk (the numbers). And here is the regexp:

Find: \(= \).*\(.\*.\)
Replace with: \1\2

This is the output:

DSCa_0004 = 6*4
DSCa_0007 = 5*7
DSCa_0015 = Do this on canvas
DSCa_0997 = 6*4
DSCb_0002 = 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = 6*4

As you can see, only lines conatining sizes have had their comments removed. Let's go through the search term. The \(= \) searches for the equals sign followed by a space (which denotes where comments begin) and stores it in \1. All text thereafter .* is not kept, until we come across \(.\*.*\) this pattern which is stored in \2. Let's unpack the \(.\*.*\) final part of the search. The first full stop searches for any character, the \* searches for the asterisk (note: the slash before the asterisk is essential as it specifies that we are searching for an asterisk * character, and not a recursive .*), and the last full stop and asterisk allows any character after the asterisk.

Example 6: Rearranging information.
Our photographer has decided that he would like to display the sizes of our photos on the left and the filenames on the right of the equals sign. Photos containing comments only should remain as they are. Here is the text:

DSCa_0004 = 6*4
DSCa_0007 = 5*7
DSCa_0015 = Do this on canvas
DSCa_0997 = 6*4
DSCb_0002 = 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = 6*4

Again the asterisk is our unique identifier of the sizes. Here is the regexp:

Find: \(.*\)\(= \)\(.\*.*\)
Replace with: \3 \2\1

And here is the output:

6*4 = DSCa_0004
5*7 = DSCa_0007
DSCa_0015 = Do this on canvas
6*4 = DSCa_0997
8*10 = DSCb_0002
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
6*4 = DSCb_0015

Our photographer could have done this at the beginning of Example 5 (if he wanted to). Here is the text from Example 5:

DSCa_0004 = Do you think you could print this one 6*4
DSCa_0007 = Print this one 5*7
DSCa_0015 = Do this on canvas
DSCa_0997 = Dave, print this 6*4
DSCb_0002 = Print this one 8*10
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
DSCb_0015 = Print this one 6*4

And here is the regular expression that will (a) get rid of all info apart from sizes, and (b) rearrange the order of the text to size = filename:

Find: \(.*\)\(= \).*\(.\*.*\)
Replace with: \3 \2\1

And here is the output:

6*4 = DSCa_0004
5*7 = DSCa_0007
DSCa_0015 = Do this on canvas
6*4 = DSCa_0997
8*10 = DSCb_0002
DSCb_0004 = Can you print this one in matte
DSCb_0011 = Make 3 copies of this one
6*4 = DSCb_0015

As you can see, the two outputs are identical.

These are only a few examples of how regular expressions can be used. I hope that these examples will be useful for you and will provide some guidance and stimulation about what is possible with regexp. If you have any questions, please post them in the comments below.