Monthly Archives: May 2019

Fun with Python- Pivot table to csv

I recently had the need to enumerate all of the filespace being used on a server. I needed it broken down by folders, and within those folders, the space consumed by popular file formats, mainly, image formats and PDFs. The server OS is Windows Server and I had to use Windows 10 as the platform to do it from.

I have played a bit with Powershell and appreciate how sophisticated programming techniques are available right there in the command line. I googled a bit and found some bits and pieces to guide to my goal, which was basically to iterate over all the folders and spit out filenames and filesizes if the filename matched a variety of necessary patterns (.jpg, .gif, etc.). This is what I came up with:

 get-childitem -path w:\wwwroot -Include *.jpg,*.png,*.pdf,*.bmp,*.gif -Recurse | where {!$_.PSIsCo
ntainer} | select-object FullName, LastWriteTime, Length, Extension | export-csv -notypeinformation -path c:\local\allfi
les.csv | % {$_.Replace('"','')}

In a nutshell, this command does this:

  • get-childitem- used to look at a folder’s subfolders
  • path- searches over a certain path
  • Include- looks for that list of filetypes
  • Recurse- look through all subfolders
  • PSIsContainer- looks for items that match filters
  • select-object- used to select the various properties of objects. Here I’ve included Fullname (something like “w:\wwwroot\folder\yadda.jpg”), LastWriteTime (a date/time stamp), Length (the size of the file in bytes) and Extension (.jpg, .pdf, etc.) That last one pretty much makes this whole thing possible, as now I can sort and report totals by filetype.
  • export-csv- yep! I am exporting the results to a csv so I can monkey with it in Excel.

After running the command, it yields a file which looks like this:


This can be put right in Excel and sorted, culled, summarized, etc. However… this document has 221,000 rows. It would take a long, long time to wrangle that by hand. And sure, some of it Excel can do out of the box, and for other things, there are macros… I might have been able to piece together various tools and code and get it done.

That’s when I thought of Python. I had already used Python to transmogrify text to do my bidding. Surely there was a way I could auto-magically achieve what I needed in a couple of steps or so. And indeed after some digging, I found some fun things using pandas (as describes it, “pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language”). I already have Python 3 and pandas on my machine to do some things with analytics.

From my days as a trainer what seemed to be needed here was a pivot table, and sure enough, there is a pivot table dataframe for pandas that will do the job. This is the script I settled on eventually- it took all of four lines.

import pandas as pd
df = pd.read_csv (r'C:\local\allfiles.csv')
print (df.pivot_table(index='Site',columns='Extension',aggfunc=sum))

Important parts:

  • df means dataframe. In the second line (after including pandas in our recipe), a dataframe for reading the generated csv file is invoked.
  • I started with the print line just to get visual confirmation it was working. I just kept it to serve the same purpose in the final version.
  • then the pivot table dataframe is created. the “Site” column is the index, and the contents of the “Extension” column are aggregated and summed for the output table- the filename of which is then specified.

Pandas is a powerful library. But what I find interesting is how it distills what is a bit of brainteaser of a concept into some clear directives, thus yielding a bit of insight into how a pivot table is structured. I guess that’s obvious, still I find it- informative? Fun? something.

But we’re not quite done here. it does output a lovely product:

pivot table printing to screen

What the heck is “4.094785e+08?” That is scientific notation. Excel will do with exceptionally large numbers if no specific format is selected. Since what’s happening here is a decrease of significant digits for the sake of brevity. I think this is a standard way of depicting these numbers when the format- in this case, the rows and columns of a pivot table- matter.

NaN means Not a Number- in this case the value is zero- nothing there.

The scientific notation bit threw me for a loop at first. I googled my fingers off looking for a way to convert the numbers to a specific format before exporting. There are some ways to do it, but I had a lightbulb moment, realizing unlike all the sorting and summing, changing number formats for a range of values is no problem at all in Excel. So I went ahead and imported it. I changed the format from General to number, and things started looking a lot more normal.

But it still didn’t strike me as being as “at-a-glance” as my needs dictated. On a Stack Overflow discussion I found several helpful tips about converting number formats to common filesize units- kilobytes, megabytes, gigabytes, etc. I decided MB was good enough for my purposes, though there are some elaborate solutions that take into account everything from bytes to petabytes- I highly recommend it. This example was what I needed:

[<1000]##0.00"  B";[<1000000]##0.00," KB";##0.00,," MB"

This will help all the numbers in this format to express themselves in terms of MBs.

After removing some extraneous bits- the filename and subfolders past the main site folders really weren’t of any use in this report; consolidating the “JPG” figure with the “Jpg” figures and “jpg” figures; click-dragging a few autosums; then setting the format of the numbers, I produced a pretty nice report of the filespace in several server folders taken up either by images or PDFs.

Not bad!