In this tutorial you’ll see how to import a text file that’s got 1.5 million rows of data, that’s roughly 500K more rows than Excel can handle.
Get the spreadsheet and data file.
Also get updates when new blog or video tutorials are released.
As of Excel 2016 going forward Excel can hold 1,048,576 rows of data, but sometimes even that’s not enough.
Here we have a large csv file with roughly 1.5 million rows of data, this definitely blows Excel’s data limit.
We’re going to use power query to look at this file so after downloading the data file…
Click on the Data tab, then choose "Get Data" then "From File" then "From Text/CSV"
Go to the folder containing your txt file and click "Import".
You’ll most likely see a "connecting" dialog box appear and disappear, then you’ll see a view of the data like so…
Click on the "Transform Data" button to launch thePower Query Editor.
From Excel’s perspective it’s not a misnomer, it truly is a "power query" editor because it can handle data sets way in excess of Excel’s million and a half row limit.
There are a number of ways to get your data into Excel from here such as…
- Use Power Query to get all of the data to Excel in chunks.
- Throw away all the data you don’t want prior to import.
- Create an unfiltered query, then using the "Advanced Editor" edit the M code and use it in an Excel VBA macro to manipulate and dynamically pull in data from the file.
Get All The Data In Chunks
This zoomed in view of "Power Query Editor", demonstrates that "Use First Row as Headers" was selected by default, if that’s not the case you may need to change that setting to "Use Headers As First Row" in which case the columns will have the generic names Column1, Column2 etc.
We’ll use "First Row as Headers" as that’s how the data’s formatted.
As my Excel’s region settings are UK and I use UK date format (Day, Month, Year) you’ll notice that the date format below is United States (Month,Day,Year), let’s do an operation on the "Order Date" column to transform it to (Day, Month, Year).
Select the "Order Date" column in this instance and choose "Split Column", then "By Delimiter" from the ribbon menu.
In this instance we’ll split the date column into three seperate columns for month, day and year so that they can be reassembled later and Excel does’nt get confused with different data formats.
Therefore we’ll choose "Custom" for the delimiter type.
The "Advanced Options" should indicate split into 3 columns for this particular project.
So we’ve split the columns based on the "/" delimiter, notice that every change to the process results in a new step being added to the applied steps box (lower right in image below).
You can view earlier stages of your data transformation process by simpley clicking on any step and the main query window adjusts to display the process as it is at that stage..
Beware of removing steps in the middle of the process as each subsequent step may have relied on the previous step thus causing the query to break.
Your now have new columns for Month, Day and Year Order Date.1, Order Date.2 and Order Date.3).
Let’s give these new columns meaningful names, such as OrderMonth, OrderDay and OrderYear.
First click the column header that you want to rename, then in the "Transform" tab and the "Any Column" Section click "Rename", then type the new column name.
Now it’s time top get the data into Excel, and since the data’s row count exceeds Excel’s, we’ll need to split file into different tabs.
In this example we’ll use seperate tabs for each region, if you click the down arrow on region we can see how many regions exist in the data, in our case we only have 7, so that’ll mean 7 tabs in Excel (If you see a "Load more" option when you press the down arrow click it to ensure you’ve got all the data).
Now you can take each region element in turn, so lets start with "Asia", click "(Select All)" to toggle all selected items off, then click "Asia" and then "OK", see above.
Now click the top left of your grid (see below) and from the drop down menu choose "Copy Entire Table".
Warning, I’ve just discovered a major weakness here, that’s that "Copy Entire Table" doesn’t do what it says on the box, it will just copy what’s loaded in your query editor, for this example here, on the first copy I got 1,000 rows for Asia to Excel, then I paged down further in the Power Query Editor as far as 7,655 records and did "Copy Entire Table" again and it copied all 7,655, but the point is there are probably 20K records for Asia, so for "Copy Entire Table" to work you would have to page down until you see all the data in the "Power Query Editor".
But this is still a very useful "quick and dirty" method of getting data to Excel where appropriate.
Then go to Excel and paste this data to a worksheet and name the tab "Asia" for example.
Rinse and repeat for the other elements in the region summary, e.g. "Middle East and North Africa", "Australia and Oceania" etc.
Doing it this way means you got all of the data into Excel.
That’s the promise of this article done, however there are other ways of doing this also.
Only Import What You Want
This means we pre-filter the data in "Power Query" prior to import.
So rather than pulling all the data into Excel, we build our summarised reports in Power Query.
Let’s go for summarised costs, revenue and profits, by Region, Country and ItemType, By Order Year.
This will probably result in a large dataset, but much smaller than 1.5 million rows that exists in the text file and still very useful within Excel.
- Split Order Date To get Year.
- Rename "Order Date.3" to "Order Year".
- Remove columns not used in the final report, i.e. "Sales Channel", "Order Priority", "Order Date.1", "Order Date.2", "Order Date", "OrderID", "Ship Date", "Units Sold", "Unit Price", "Unit Cost".
Move newly created "Order Year" to left most position for sorting purposes.
That should leaves us with the following columns "Order Year", "Region", "Country", "Item Type", "Costs", "Revenue", "Profits".
To do this Shift Click on each Column Header to select and then Right Click and choose "Remove Columns" while cursor is hovering over a column header.
Now lets create a summary query.
We’ll group by "Order Year ", "Region ", "Country " and "Item Type ", we’ll also sum the "Total Revenue ", "Total Cost " and "Total Profit " columns and rename them to "Revenue ", "Cost " and "Profit ".
Click OK, then power query will process for a few seconds prior to report creation.
-
You can then Sort The Report, sort order seems to be simpler (i.e. less powerful) than Excel’s native sort functionality, meaning data is sorted left to right in the order it’s displayed.
So click the column header that you want to sort then, choose "Ascending" or "Descending", we’ll do this for "Order Year", "Region", "Country" and "Item Type".
Note that each column is sorted individually, meaning you click sort on a column, then wait until the sort is performed, then move onto the next column and sort that and so on.
The Order that you click the columns in governs primary, secondary sort order etc, note the number in the red circles below, you’ll see that "Order Year", "Region", "Country" and "Item Type" are Top Level, Second, Third and Fourth level sort orders.
Now send the data to Excel, you can choose "Close & Load", or "Close & Load To…" I chose the latter, which gave me the following dialog box to refine my choice, in this instance I went with the default "New worksheet".
-
The end result is 17,761 rows of data in Excel, considerably less than 1.5 million but still highly usable.
James Robson says
Hi Sean,
I’m bringing data into Power Query via an ODBC from our ERP system and have a table containing 2 millions rows. I then apply multiple filters, merging and formulas to create approx. 1,000 lines I bring into excel removing the need for any excel formulas (Great!).
However the 2 millions lines are after only 6 months and already take considerable time to load and my concern is in several years time this will become unusable.
Are there any steps that I could take to speed this process up (generally I apply filters first to reduce the number of lines straight off the bat)?
Sean Johnson says
It sounds like you’re trying to “download the database” which isn’t a good idea, the database should stay on the server and you need to just get specific answers to your questions.
Taking down 2 million rows in six months has gotta mean your doing something wrong, you either need to design various summary queries to answer specific questions, or build an access database to store this information or see if you can have a chat with the database administrator for this system.
Tell him or her what you’re trying to achieve and they may give you some pointers.
Fabricio Miriani says
Hi Sean!
Question – when having to get information from two files with +5Millon rows: would you perfom the JOIN in the SQL Access Database and then bring it to the Power Query?
I’m struggling to play with some data I need to use as every step I do takes lots of time.
Sean Johnson says
To be quite honest Fabricio
With a file that large, I’d do as much of the processing as possible in Access or Sql Server, i.e do all sql join and filter operations in the database and only bring the answer into power query.
Databases in my opinion are much better for database operations, power query is for business people to do stuff with data, data that often resides in enterprise databases and have been pre filtered
via existing sql views or stored procedures, so if you’ve got 5 to 10 million rows do as much as possible in Access and compact the database often as it will bloat with every operation.
So to be clear I do all of my heavy lifting in databases rather than power query I then bring in the result from the database via code directly to Excel.
Michael Ger says
Thank you, Sean, so much! Being unable to load data into the database for sorting, I spent hours trying to figure out how to do that for the csv file larger than the workbook limit. Your tutorial made my day!
Sean Johnson says
Glad I could help Michael