LINQ is cool, but why?

Intro

I’ve got to use LINQ a fair bit lately. Not to query a database, but on lists with LINQ To Object. Here’s a little test I did using the Aggregate extension.

Just a small clarification

When I talk about LINQ I’m talking about LINQ To Object with queries performed on IEnumerable or IEnumerable<T> collections. I’m not using query syntax, but method syntax with a heavy dose of lambda expressions.
Side note, LINQ is not lambda. LINQ or Language-Integrated Query provides a set of extension methods which can be found in the System.Linq namespace. A lot of these methods require a Func<T> in some form or another, which are easily written as a lambda expression. You’ll also hear the term ‘predicate’ when using LINQ. This is just a special type of Func<T> that always returns a bool (usually Func<T, bool>).

An example to work with

Imagine you have a text file (.csv) that you need to parse, but you need to compare each line with the next one. E.g. we have a file that contains daily data, and you store this data as a set of time series in a database, values are valid from some start date until an end date. The data should be continuous (no gaps), and the text file contains only start dates and values, where the start date of the next record is the end date of the current one. Looking something like this:

2018-10-22 00:00,10.53
2018-10-23 00:00,9.5
2018-10-25 00:00,10.51
2018-10-30 00:00,9

You know it’s daily data, but you can’t simply read each line and say the endDate = startDate + 1 day. Because they don’t write a line if the value doesn’t change. In the example above, on 24/10 the value is still 9.5. So, you don’t know the endDate until you reach the next line.

You can loop over each line, storing the date and value in a temporary ‘previous’ values variable and then create your data value after the facts. The code is fast, but the readability factor is low.

Demo time

If you want to follow, create a Console App and add a small class to make a list of ranged values. We’ll create a small text file with 2.000.000 lines and then parse that to our list of ranged values. We’ll use a Stopwatch to give an indication of performance. You can paste all the example code just after the Stopwatch initiation.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.IO;
using System.Linq; //no LINQ without this

namespace LinqAggregate
{
  class Program
  {
    class RangedValue
    {
      public DateTime Start { get; }
      public DateTime End { get; }
      public double Value { get; }

      public RangedValue(DateTime start, DateTime end, double value)
      {
        Start = start;
        End = end;
        Value = value;
      }
    }

    static void Main(string[] args)
    {
      var filePath = @"C:\Temp\Aggregate.txt";
      var result = new List<RangedValue>();
      CreateFile(filePath);

      if (File.Exists(filePath))
      {
        var lines = File.ReadLines(filePath);
        Stopwatch sw = new Stopwatch();	  
        //TODO: paste examples here
      }
      Console.ReadLine();
    }

    static void CreateFile(string filePath)
    {
      using (var file = new StreamWriter(filePath))
      {
        for (int i = 0; i < 2000000; i++)
        {
          var date = DateTime.Now.Date.AddDays(i);
          if (date.DayOfWeek == DayOfWeek.Saturday || date.DayOfWeek == DayOfWeek.Sunday)
            continue; //this is not a weekend job
          file.WriteLine("{0:yyyy-MM-dd},{1}", date, i * 0.02);
        }
      }
    }
  }
}
 
Let’s try the ‘old’ way first. We’ll loop over the lines, storing the previous values and add to the ranged list.
#region _- No LINQ -_

sw.Start();
var arrLines = lines.ToArray();
Tuple<DateTime, double> previousLineValues = null;
for (int i = 0; i < arrLines.Length; i++)
{
  var fields = arrLines[i].Split(',');
  var dateTime = DateTime.ParseExact(fields[0], "yyyy-MM-dd", CultureInfo.InvariantCulture, DateTimeStyles.AssumeLocal);
  var value = double.Parse(fields[1], CultureInfo.InvariantCulture);
  var currentLineValues = new Tuple<DateTime, double>(dateTime, value);
  if (previousLineValues == null)
    previousLineValues = currentLineValues;
  if (currentLineValues.Item1 > previousLineValues.Item1)
  {
    result.Add(new RangedValue(previousLineValues.Item1, currentLineValues.Item1, previousLineValues.Item2));
    previousLineValues = currentLineValues;
  }
  if (i == arrLines.Length - 1)
    result.Add(new RangedValue(currentLineValues.Item1, currentLineValues.Item1.AddDays(1), currentLineValues.Item2));
}

sw.Stop();
Console.WriteLine($"Not using LINQ: {sw.ElapsedMilliseconds} ms");

#endregion _- No LINQ -_

On my machine I’m getting an average time of 2350 ms, which is pretty fast for parsing 2.000.000 lines. The readability on the other hand leaves a lot to be desired. We have a for loop and 3 if statements, that’s too much indentation. I pity the next developer who has to maintain this code.

Remark: At first, I was thinking we were violating the single responsibility principle. But that’s about class design, not method design. I’ll get into that in another blog post, keep an eye out for that 😉.

Let’s see what we can do if we introduce LINQ. There’s a LINQ extension method that will do exactly what we need: The Aggregate method. More particularly, the simplest version of the 3 available overloads, Aggregate<TSource>(IEnumerable<TSource>, Func<TSource,TSource,TSource>). Simple? Well yes, once you get to know it 😉. According to the definition on Microsoft docs the aggregate method “applies an accumulator function over a sequence” (+1 for simplicity 🙄). In other words, it iterates (loops) over a list (IEnumerable) applying a Func<T, T, T> that takes in the current and next value and uses the return value as the current value in the next iteration.
An example seems in order here. Say we do an addition on a list of numbers: 1 + 2 + 3 + 4 + 5.

var list = new List<int>() { 1, 2, 3, 4, 5 };
var x = list.Aggregate((c, n) => {
  var sum = c + n;
  Console.WriteLine($"c = {c}, n = {n}, return {sum}");
  return sum;
});

Output:
c = 1, n = 2, return 3
c = 3, n = 3, return 6
c = 6, n = 4, return 10
c = 10, n = 5, return 15

Note: if you only need the final result, the lambda can of course be simplified as

var x = list.Aggregate((c, n) => c + n);

What’s going on?
The aggregate method will loop 4 times, it starts with the first 2 items, and then applies the Func to each following item in the list (skipping the first item! Read the remarks on Microsoft docs).

But in our example, we’re not adding or doing anything with the value, how is this helpful? The Func takes in the current and next and the return value is the current for the next iteration, so we just return next.

#region _- LINQ -_
                
result = new List<RangedValue>();
sw.Reset();
sw.Start();
lines.ToArray().Aggregate((curr, next) =>
{
  var currValues = curr.Split(',');
  var nextValues = next.Split(',');
  var start = DateTime.ParseExact(currValues[0], "yyyy-MM-dd", CultureInfo.InvariantCulture, DateTimeStyles.AssumeLocal);
  var end = string.IsNullOrEmpty(nextValues[0]) ? start.AddDays(1) : DateTime.ParseExact(nextValues[0], "yyyy-MM-dd", CultureInfo.InvariantCulture, DateTimeStyles.AssumeLocal);
  var value = double.Parse(currValues[1], CultureInfo.InvariantCulture);
  result.Add(new RangedValue(start, end, value));
  return next;
});
sw.Stop();
Console.WriteLine($"Using LINQ: {sw.ElapsedMilliseconds} ms");

#endregion _- LINQ -_

That looks like less code, but we’re losing on performance. I’m getting an average of 3550 ms. Probably all that parsing we’re doing on both the current and next line is slowing us down. What if we create a temporary list that holds the parsed values and we aggregate over that?

#region _- 2-stepped -_

result = new List<RangedValue>();
var temp = new List<Tuple<DateTime, double>>();
sw.Reset();
sw.Start();
foreach (var line in lines)
{
  var fields = line.Split(',');
  var dateTime = DateTime.ParseExact(fields[0], "yyyy-MM-dd", CultureInfo.InvariantCulture, DateTimeStyles.AssumeLocal);
  var value = double.Parse(fields[1], CultureInfo.InvariantCulture);
  temp.Add(new Tuple<DateTime, double>(dateTime, value));
}
temp.Aggregate((curr, next) =>
{
  result.Add(new RangedValue(curr.Item1, next.Item1, curr.Item2));
  return next;
});
sw.Stop();
Console.WriteLine($"Using 2-step LINQ: {sw.ElapsedMilliseconds} ms");

#endregion _- 2-stepped -_

Well, our aggregate implementation is simplified, and our performance is back to what it was. Great, done, right? Strictly speaking, yes, but I still feel it’s too clunky. Let’s refactor that foreach in a separate method and introduce the yield keyword.

static IEnumerable<Tuple<DateTime, double>> ParseLines(IEnumerable<string> lines)
{
  foreach (var line in lines)
  {
    var fields = line.Split(',');
    var dateTime = DateTime.ParseExact(fields[0], "yyyy-MM-dd", CultureInfo.InvariantCulture, DateTimeStyles.AssumeLocal);
    var value = double.Parse(fields[1], CultureInfo.InvariantCulture);
    yield return new Tuple<DateTime, double>(dateTime, value);
  }
}

With the yield keyword we’re parsing 1 line at a time, returning the result immediately.