How to become a data crunching wizard using a combination of the Advanced Filter and Regular Expressions in LibreOffice Calc.
I originally wrote this article back in January 2016, but is seems that LibreOffice might have changed it’s regular expression engine, because a lot of the stuff that worked back then doesn’t seem to work now.
Therefore I’ve completely rewritten this article from scratch for 2021, so let’s get filtering.
Get the database, files and sql used in the tutorial video above.
Also get updates when new blog or video tutorials are released.
Setting up your Advanced Filter
I’ve taken 43 rows of blog article information from one of my LibreOffice base databases and pasted it into calc.
The basic rule for setting up your data is that it should look like the image below.
That’s to say data should be in a rectangle with a header row and have no other cells with data “touching it”, meaning if you look at this image, you will see that column "F" is the last column, therefore column "G" and row "45" should be completely blank because they’re adjacent to this data block.
In order to perform a filter extract (meaning that in another area of your choosing in the sheet, you will have a subset of your data extracted there that meets your critera) you will have to setup a "Criteria" area.
To do this copy or type the header name of the column that you would like to query on, in our example I’m using the "PageTitle" header and entering the criteria into the cell below.
Given that my data in this example ends with Column "F" I would most likely type my criteria header name into cell "H1" in order to leave a blank column between it and the data (because you don’t want other cell values touching your data range).
Finding rows that contain some text
Now let’s look at the criteria that I just typed into cell "H2", the purpose of which is to find rows of data in column B that contain "vel" anywhere in the cell.
So that you understand where I’m coming from the word "developer" and "level" both contain "vel".
Now the contents of the criteria cell are .*vel.*.
Reading this regular expression from left to right…
Find me any one character (that’s what this dot or period does) ..
Zero or more times, that’s the asterisk *, which means match the preceding character zero to many times, the combination of this dot and asterisk .* effectively creates a "match any text" query, in SQL and Excel the astrix on it’s own has this function.
Next within this string defined above (i.e. match anything) we need to find the text "vel".
The .* after vel means that vel can be either the end of the line of text, or can be inside a larger piece of text (i.e. zero or many times).
Now let’s execute this query in LibreOffice Calc.
First click on the data range (you can see in the image below I’ve clicked on cell "B8") then choose "Data", then "More Filters" and "Advanced Filter" from the menu.
The "Advanced Filter" dialog box will then appear, first click in the box to the right of "undefined" then click on the range ("paint") that you want to use for the criteria, in this instance I’ve used cells "H1:H2".
Next click on the "+ Options" in order to expand the dialog box.
Now you need to click on the "Regular expression" check box and the "Copy results to" checkbox.
Then click in the text box next to "undefined" (below "Copy results to") and paint the range that you want your sub report extracted to, in this instance I chose cell "J1".
Now click "OK" and your sub report (results) will appear in that area.
Note that you also have the option to tick "case-sensitive" if you so wish, which you can override for specific words using brackets [] for even more power, so for example all your searchs could be case sensitive if you want, with the exception of "data specialist" which could appear as [Dd]ata [Ss]pecialist.
The image below demonstrates the rows in the data that contained "vel" in the page title.
Now that I’ve demonstrated how to perform the extract, in the next examples I’m just going to focus on the criteria differences for different reports.
The YouTube video that goes with this post (Which you can watch on YouTube if you prefer) will demo this "live" well live-ish ;-).
The other thing to bear in mind is that if you intend to do multiple extracts for different queries to the same cell "J1" in this example, then it’s best to wipe the previous result data (in columns "J" through "O" in this example) prior to your next extract, otherwise things can get confusing, but if you’re extracting to different areas for each sub report, then that’s not a problem.
Finding rows that start with some text
In order to better explain the following point, I’ve added in a fake row of data to row 45, see below.
So let’s say that I want to find rows that begin with "Excel"
The symbol for text STARTS WITH something is ^ so in this case the criteria would be ^Excel or ^excel both work, as the default setting with the extract is to have case sensitivity unticked, if for some strange reason you have case sensitivity ticked but you want both of the previous capitalisations to work, then the query is ^[Ee]xcel the brackets obviously meaning choose one of these characters.
So after executing this query notice that we only get one row returned, i.e. the fake row that I added to the end of the list range, event though records with the following ID begin with Excel 1, 8, 30, 37 and 99.
This is because only record 99 is an exact match, because the query says "begins with Excel and has nothing else", so in fact given that this was an exact match query (because it’s missing a wildardcard after "Excel" i.e. .*) just typing Excel into the criteria cell in "J10" would have given us the same result.
Now let’s do it proper, like…..
Here we’ve used the query "^[Ee]xcel.*", but "^Excel.*" or "^excel.*" would have worked equally well as we haven’t ticked "Case-sensitive" in the advanced filter dialog box.
Finding rows that end with some text
Suppose now for some crazy reason I want to find blog posts that END WITH " base" in their title (note the blank space before base).
Let’s do that now…
So again, given that the Regular expression engine reads left to right, we must first say give me anything .*, then find " base" (again pay attention to the blank space before base, this ensures we don’t select database when we want "LibreOffice base") and it must be the final element in the line $.
In LibreOffice calc I’ve gotta make the point though, it seems that the dollar at the end is redundant, (though not for regular expressions in general) because if you don’t put any wildcards at the end of your criteria expression then that must mean that it is the end.
The image above bears out this argument, I’ve done two seperate extracts, the one on top just has wildcard, space and base .* base and the extract below it has $ and the results are exactly the same.
Note in the examples above instead of a space before base you could instead use \b which indicates a word boundary, therefore your criteria would look like .*\bbase note the \b in there (which let’s face it are more elegant, if you can remember to use them!).
Widen your search
The regular expresssion engine behind the LibreOffice calc advanced filter makes it at least as powerful as the Excel Advanced Filter because you have the options of finding things "the old way" or the regex way.
To demonstrate this let’s find rows from the list that contain "wordpress" or "ribbon" or "import".
So this advanced filter was done the same way as all the others with the exception that the criteria range was now "K1:K4", this translates as give me any rows that contain "wordpress" or "ribbon" or "import".
You can see below that the result was those three rows.
Now let’s do the same query using pure regex (the previous example was a mixture of regex and classic advanced filter).
I gotta say I kinda like this method more, as it’s concise and most likely easier to incorporate into a macro via string manipulation rather than the multi row approach.
So for clarity, let’s decode this one .*(wordpress|ribbon|import).* so as usual we have our wildcards either side of our main text search .* we then create a group with the "or" aka "pipe" symbol (vertical bar to the left of Z on UK keyboard) to delimit each of the items that we want to search for.
Narrow down your search
Here we want to tighten up our criteria, until we find exactly what we’re looking for.
We’ve got to use the hybrid method here, there’s no easy regular expression way of doing boolean "AND" queries so we’ll use the advanced filter method combined with regex.
So let’s start by entering .*LibreOffice.* as our initial query to find rows that contain the page title "LibreOffice" this gives us a resultset of 7 rows.
So all rows above contain "LibreOffice" however I need to see the records that relate to LibreOffice forms as opposed to Excel forms, therefore the boolean logic is "LibreOffice" AND "form", to achieve this I place the AND column header next to the existing header and add the extra criteria item there (note that the columns don’t have to be the same, I could have said "LibreOffice" and "2016" which would have meant that I put "Year" next to "PageTitle")
So "LibreOffice" and "form" has reduced the 7 records down to 3 that contain both terms.
This can be further reduced as in the next image by adding another header to the criteria for "Images" which gives us just one row that contains all three words in the title.
Note that for each extract the criteria has to be selected again covering all elements.
Find blank cells
This one seems to be simple enough, use the "line starts with" character next to the "line ends with" character with nothing between them ^$ will give you the rows that contain a blank cell for that column.
Find Non Blank Cells
I’ve tested this one out and it seems to work, the asterix means give me zero or more characters wheras the plus means give me one or more of the character, so in your criteria column enter .+ and your extract will have rows that don’t contain any blanks for the column that you’ve run this criteria on.
Carlos says
Very GOOD. Thanks for the help (in name of everyone)
Carlos says
By the way… WHAT IF… we want to extract a SUBSTRING (where they might be there) anywhere in the cell range ?
I mean… it doesn’t BEGIN WITH or END WITH only just is “there… somewhere).
Wild cards other than these ?
Thank you.
Greetings
Carlos says
Sorry… because I meant NOT to extract a SUBSTRING… in fact, to get a report of a range WITH only the data WHERE that SUBSTRING is (anywhere in the MIDDLE of every CELL, if that SUBSTRING really exists).
Could we associate the VLOOKUP function with another to get there ?
Thanks again.
River Judd says
Again, very helpful
Sergei says
Thank you for clarifications, but one more question arise.
The regexp search works fine, but it is quite slow on large tables.
For example I have a list of items with about 50K lines. I need to filter only lines contains (in same cell) item containing either “zalman” and “cooler” either “zalman” and “fan”.
As I said, the regexp filter you recomnmended works fine, but it is very slow. After the filtering I can choose the “conditional filer” – “standard filter”, and I can see “column_name IS EQUAL TO (“=”) regexp”
But in this settings I can manually choose “contains” and use the desired substrings without regexp. Such filter works much faster.
And the question is: how to designate “contains” condition in the “filter conditions” cells?
For example I can use the math less-than and more-than signs like “>1000” AND “<1500". But what about the substring search?
Any ideas aside manual configuration (or manual ajustments) of the conditional-standard filter?
Thank you.
Sean Johnson says
I’m afraid I don’t have any suggestions on that, I guess the regex has a lot of work to do when you’re hitting 50K lines in your spreadsheet, chances are you’re getting in to the area of perhaps needing to use the base database to prefilter your data. Regex is one of many tools to solve a problem, when you say it’s very slow, spreadsheets use a lot of ram and if you’ve got 50K records maybe the sheet is causeing your pc to cache to disk depending on your config.
But I’m afraid it is what it is, I’m not a RegEx guru, so I guess the answer is test and experiment.