C#

C#: Query Excel and .CSV Files Using LinqToExcel


Image by Elif AyiterSome Rights Reserved

In my quest to get a rather hack-ey work project DONE and off my machine, I discovered a couple exceptionally useful OSS tools for working with MS Excel data, and at the same time, I also found a handy library, DocX, for working with MS Word files which not only spares one the issues of Interop/COM, but in fact does not require Interop assemblies at all. In fact, it doesn’t even require that Word be installed on your machine at all. I take a look at DocX in my next post.

Mostly, I need to get data out of Excel, which admittedly, can be done using ADO.NET. However, LINQ to Excel makes this pretty disgustingly easy.

Use LinqToExcel to Query Spreadsheet Data Using (wait for it) . . . LINQ

LinqToExcel by Paul Yoder is a .Net library which facilitates querying Excel spreadsheets and .CSV files using the familiar LINQ syntax. The source is available at Yoder’s Github Repository, and there is also a Nuget package available. The source is available with a broadly permissive license which essentially allows you to do whatever you want with the code, so long as a copy of the license is included.

It is important to note that you need to use the 64 bit version of LinqToExcel if you are targeting the x64 processor architecture with your application. Also in this case, you will need the 64 bit version of the Access Database Engine installed.

UPDATE 1018/2013 – As it turns out, as cool as LinqToExcel is for most cases, for MY specific project there was an issue. That dependency on the ACE Database engine. My project was to be deployed to Windows Azure Websites, which does not have support for the ACE Engine. Back to the drawing board. I found another library, ExcelDataReader, which is not as slick as LinqToExcel for simply reading data, but which carries NO DEPENDENCIES on proprietary MS technology which would create issues deploying to a remote server.

Get the LinqToExcel Package Using Nuget

To get started, either go to the Solution Explorer, you can simple enter the following in the Nuget Package Manager Console:

Get LinqToExcel Using the Nuget Package Manager Console:
PM> Install-Package LinqToExcel

Or, you can right-click on “Manage Nuget Packages for Solution” and search for “LinqToExcel” in the online packages in the Package Manager window:

Get LinqToExcel Using Solutions Explorer GUI Menu Item:

install-linq-to-excel-nuget-pkg

The Simplest Way to Query Data from Excel

Wanna get some data from an Excel spreadsheet? It’s this easy:

Consider the following spreadsheet:

Example Spreadsheet Data:

ArtistAlbumsSpreadsheet-Excel

Here is the very simplest way we can query this data and (for the sake of this example) write it out to the console:

Straight and Simple Query Using Linq Syntax:
void PrintArtistAlbums()
{
    string pathToExcelFile = ""
        + @"D:\Code\Blog Projects\BlogSandbox\ArtistAlbums.xlsx";
  
    string sheetName = "Sheet1";
  
    var excelFile = new ExcelQueryFactory(pathToExcelFile);
    var artistAlbums = from a in excelFile.Worksheet(sheetName) select a;
  
    foreach (var a in artistAlbums)
    {
        string artistInfo = "Artist Name: {0}; Album: {1}";
        Console.WriteLine(string.Format(artistInfo, a["Name"], a["Title"]));
    }
}

That’s it. Initialize an ExcelQueryFactory object, and use Linq the way you are accustomed to.

In the previous example, we are accepting what LinqToExcel sends us by default, which is an ExcelQueryable<LinqToExcel.Row> object (which implements IQueryable and other useful interfaces). Each Row object is composed of (take a guess . . .) LinqToExcel.Cell objects.

Materialize Data into an Object Using LinqToExcel

But it gets better. LinqToExcel will also materialize our data to our objects, if the column names match:

LinqToExcel Query Materializing a Class:
public class ArtistAlbum
{
    public string Name { get; set; }
    public string Title { get; set; }
}
  
void PrintArtistAlbums()
{
    string pathToExcelFile = ""
        + @"D:\Code\Blog Projects\BlogSandbox\ArtistAlbums.xlsx";
  
    string sheetName = "Sheet1";
    var excelFile = new ExcelQueryFactory(pathToExcelFile);
    var artistAlbums = from a in excelFile.Worksheet<ArtistAlbum>(sheetName) select a;
  
    foreach (var a in artistAlbums)
    {
        string artistInfo = "Artist Name: {0}; Album: {1}";
        Console.WriteLine(artistInfo, a.Name, a.Title);
    }
}

Note the items called out in red above. In this case, we have told LinqToExcel that the objects it expects to get back from the query should map to the ArtistAlbum class. If your column names match, you’re all set – pull your data in and populate the class like you would with any other Linq query.

If your columns DON’T match, you still have some options. Let’s say our boss gives us a spreadsheet each week to Do Something Important with in our application. He is also fond of crazy column names. We can take advantage of the AddMapping method provided by the ExcelQueryFactory object to map spreadsheet columns to the right property on our class.

Let’s say that our ArtistAlbum class is defined as before, but the boss insists on naming the “Name” column on HIS spreadsheet as “Artist Name,” and the “Title” column as “Album Title”.  In this case, we can do this:

Use the AddMapping Method to Map Column Names to Property Names:
void PrintArtistAlbums()
{
    string pathToExcelFile = ""
        + @"D:\Code\Blog Projects\BlogSandbox\ArtistAlbums.xlsx";
  
    string sheetName = "Sheet1";
    var excelFile = new ExcelQueryFactory(pathToExcelFile);
  
    // ADD COLUMN MAPPINGS:
    excelFile.AddMapping("Name", "Artist Name");
    excelFile.AddMapping("Title", "Album Title");
  
    var artistAlbums = from a in excelFile.Worksheet<ArtistAlbum>(sheetName) select a;
  
    foreach (var a in artistAlbums)
    {
        string artistInfo = "Artist Name: {0}; Album: {1}";
        Console.WriteLine(artistInfo, a.Name, a.Title);
    }
}

Use LinqToExcel to Query Workbook MetaData

We can use the GetWorksheetNames method to query Excel and return an IEnumerable<string> containing the names of all the Worksheets within a Workbook:

Retrieve the Worksheet Names Contained within an Excel Workbook:
public IEnumerable<string> getWorkSheets()
{
    string pathToExcelFile = ""
    + @"D:\Code\Blog Projects\BlogSandbox\ArtistAlbums.xlsx";
    
    // This is an expensive operation - Do this only once if possible:
    var excelFile = new ExcelQueryFactory(pathToExcelFile);
    var worksheetsList = excelFile.GetWorksheetNames();
    return excelFile.GetWorksheetNames();
}

Query Column Names on an Excel Worksheet

Often, it’s handy to be able to grab the column names available within a particular worksheet. That’s also easy:

var columnNames = _excel.GetColumnNames(selectedSheetName);

This returns a simple IEnumerable<string> containing the column names in the Worksheet.

Using The LinqToExcel GetColumnNames Method with Reflection – Example

This might be useful as a way to check and see if the properties exposed by one of our classes are all represented by columns in a spreadsheet. We can use reflection to walk the public properties exposed by a class instance, then compare the property names with our column names (case-insensitively, if we choose). This is a really basic example, but you get the idea:

Using the GetColumns Method to Validate Worksheet Columns Against Object Properties:
public IEnumerable<string> ValidateSheetColumns(string selectedSheetName)
{
    string pathToExcelFile = ""
    + @"D:\Code\Blog Projects\BlogSandbox\ArtistAlbums.xlsx";
  
    // This is an expensive operation - Do this only once if possible:
    var excelFile = new ExcelQueryFactory(pathToExcelFile);
  
    // Grab the column names from the Worksheet:
    var SheetColumnNames = excelFile.GetColumnNames(selectedSheetName);
  
    // Grab a list of the properties exposed by our class instance:
    var prototype = new ArtistAlbum();
    var ClassProperties = prototype.GetType().GetProperties();
  
    // Create an enumerable containing the property names:
    var requiredColumns = (from p in ClassProperties select p.Name);
  
    // Copy the names into our missing columns list:
    var missingColumns = requiredColumns.ToList();
    foreach (string column in requiredColumns)
    {
        foreach (var candidate in SheetColumnNames)
        {
            if (column.Equals(candidate, StringComparison.InvariantCultureIgnoreCase))
            {
                // Any items left represent properties which were
                // not found to match in the column names (case-insensitive):
                missingColumns.Remove(column);
            }
        }
    }
    return missingColumns;
}

Of course, the above only works well with simple POCO classes. If you have read-only properties on your class, you might run into problems. But for the sake of example, you get the point.

Note about Performance

Queries against Excel using LinqToExcel are nice and fast. However, it is important to remember that the initialization of an instance of ExcelQueryFactory is relatively expensive (you are essentially initiating a database connection from a cold start, then pulling a large amount of XML data into memory). This should be considered in designing your application’s use of this library.

Where possible, I recommend doing your best to initialize the ExcelQueryFactory as few times as possible within the context of a given operation. My examples above include initialization in each example so that they can stand on their own. Most likely, though, in a real-world context I would try to make the ExcelQueryFactory instance a class member variable where it made sense.

Wrapping Up

LinqToExcel is a cool enough tool that I decided to write this quick article. While I personally avoid working with Excel data wherever possible, (Excel is for analysis of properly persisted data, dammit!), there is often no getting around it. While there are certainly other ways to work with Excel data in your application, LinqToExcel makes this about as simple as it can get.

Additional Resources and Items of Interest

C#
Understanding Windows Azure Blob Storage (A Quick Overview)
Biggy
Biggy: Interface Semantics and The Interface Segregation Principle
ASP.Net
ASP.NET Web Api: Understanding OWIN/Katana Authentication/Authorization Part II: Models and Persistence
  • jatten

    jattenjatten

    Author Reply

    @Ushitha – Yeah, I was checking that out tonight.
    @Eric – While LinqToExcel does NOT require interop, it DOES require that the ACE database engine be installed. Which absolutely borked my deployment to Azure (ACE is not available on an Azure Website – gotta spin up a worker, or use a VM, neither of which is an option in this case).

    For most use cases, the ACE redistributable would probably work, but you do need to be able to install it, so far as I know to this point (in other words, it is not a dll, but an exe).

    But, indeed, the LinqToExcel library is pretty handy for those hateful moments . . .

    Both, also check out Excel Data Reader (it's on CodePlex, google it – I can;t put a damn link in here). Not quite as smooth as LinqToExcel, but it works against the Excel Binary file and uses OpenXml. No ACE, no Interop. The API is not as cool, though – returns datables (ugh).


  • Uchitha

    UchithaUchitha

    Author Reply

    I have used a library called EppPlus with similar success. No interop requirements, Linq support, Open Source etc…


  • Erik Dietrich

    Wow, nice. Definitely adding this to my arsenal for those unfortunate times when I have to touch Excel.


  • jatten

    jattenjatten

    Author Reply

    @Eric –

    It takes NO DEPENDENCY on interop, which is precisely why I used it!

    Far as I can tell from skimming the source, it requires the ACE Database engine. I infer from this that there is some OleDb or ODBC happening in the background somewhere. But is sure makes getting at the Excel data easy . . .

    Even if it does nothing but abstract the OleDb data access, it is still a win.


  • Erik Dietrich

    Do you know if this library takes a dependency on the Office Interop libraries, or does it parse the actual file bytes of the Excel files?