A simple PowerShell script to find and replace using regular expressions in multiple files
One thing I need which I come across from time to time is
the ability to perform a find and replace operation in
multiple files, using regular expressions. When this
happens, I usually tend to exploit
Visual Studio's own support
for this kind of necessity; soon, however, I have to give it
up and blame my favorite IDE for the lack of adherence with
the regular expressions syntax adopted by the .NET
framework, which I'm used to.
So, today, after my
umpteenth unsuccessful attempt with Visual Studio, I
resolved to implement a simple PowerShell script stub, which
would act as a strating point for performing this job for me
hereafter. No, this is not by far a complete grep-like tool;
I would like it to be just a demonstration of how easy,
powerful and "clean" are PowerShell scripts like this one.
And yes, I know there is plenty of third party tools which
do this kind of things...
To go down into the
specifics of my problem, I was trying to combine a set of
html files, that I grabbed after a CHM to HTM conversion,
into a single one; since images inside these documents are
just thumbnails contained inside an hyperlink which let the
user eventually click to see the image at the original size,
I want to perform some regular expressions substitution in
order to have the original size image embedded directly into
the document, have the thumbnails removed and the header and
footer of each individual html file removed before being
combined into the target one.
Since PowerShell is a
.NET managed shell, we can naturally use our beloved
Regex class
to perform our regular expressions substitution, thus
adopting the syntax we are accustomed with.
Here's the code which today solved my problem; feel free to
use it as the starting point for similar issues:
$rxFigure = New-Object System.Text.RegularExpressions.Regex
"(?:\<span.class=""figure""\>)\<a.href=""(?<Url>.*?)"".*?\>\</a>"
$rxHeader
= New-Object System.Text.RegularExpressions.Regex
"<html>.*?</table>", SingleLine
$rxFooter =
New-Object System.Text.RegularExpressions.Regex
"<table.*?</html>", SingleLine
$combinedOutput
=
[System.IO.File]::CreateText([System.IO.Path]::Combine((Get-Location).Path,
"..\Combined.htm"))
Get-Item "*.html" |
ForEach-Object {
$text =
[System.IO.File]::ReadAllText($_.FullName)
$text =
$rxFigure.Replace($text, "<img src=""`${Url}"" />")
$text = $rxHeader.Replace($text, "")
$text =
$rxFooter.Replace($text, "<br style=""page-break-after:
always;"" />")
$combinedOutput.Write($text);
}
$combinedOutput.Close()