BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Mathias Brandewinder on F# for Data Science

Mathias Brandewinder on F# for Data Science

Bookmarks
   

1. We are here at BuildStuff in Lithuania. I am sitting here with Mathias. Mathias, who are you?

Hi. I am Mathias Brandewinder. The short version would be that I am a French person with a bit of German in me, who lives in San Francisco. People also know me as a huge fan of F#. That is why I am here – to give some talks at Build Stuff on F#. I guess that is me.

   

2. Why F#?

That's a good question, which I get asked a lot. I came to F# completely by accident. I read that you should be learning a new language every year. That happened in 2010. So I opened the books and I saw there was a new language and I started playing with it and I just fell in love with the language. So, many people use F# for completely different topics – it is hard to give an answer, because F# is good at lots of different things, just like C# is good at different things and it is hard to answer “Why C#?”; it depends on what you do. One of the reasons I like it a lot – one of my areas of interests is in machine learning. My background before being a software engineer was actually operations research, applied math, economics and all of this. From my standpoint, object-oriented languages are just not very good for presenting that type of problem and F# has been just fantastic for that type of work.

So, one of the main reasons I love it – it just works for that type of job. There are many reasons that would be the case, but one them, which is a given in other languages, F# has a scripting environment with a REPL and if you have not used a REPL that may not be a big deal, but the moment you start using it, it is really hard to go back, especially when you work with data. With a REPL I can say “Hey, load the data” and then start hacking at it and that is really nice. That makes a huge difference because if I do not have a REPL what I would do is I would load the data, hack at it, change the code, rebuild, reload the data, hack again – so it is the difference between wasting half of my day loading data rather than spending my entire day working with data. The other thing – I have this tendency of not stopping when I start talking about F# – but the other reason I think F# is a very interesting language for the purpose is that people tend to like Python a lot for machine learning and data science and one of the reasons is that in general, static types get a little in the way when you try to work with data, because life is all good when you are writing your own code, you create your classes, your types and you are a happy camper. The problem is that the other half of your life is dealing with data which is not in your type system, like maybe you have to pull a JSON service, maybe a SQL server, maybe an XML document – whatever – and all these things are not in your types. So, if you are in a statically typed language, typically you would have to spend an inordinate amount of time just transforming data which is not in your world into data which is your world, which is your types. So, if you take a language which is dynamic, like Python, you do not really have that problem. You can start, because anything can be anything.

You just code things and if it works, it works. So you can start hacking faster. Of course, there are two drawbacks: the first one is that the compiler is not really going to help you much. So, what you gain in speed or in hackability, you lose in runtime errors. The only way to see if it works is by running it and see your code explode if you make a mistake. The other problem is, maybe not completely general, but with dynamic languages is that they tend to be a bit slow. So, they are really nice for exploration, but then there is the moment when you want to run it against a real dataset, on a real system in production – that is going to be not that great. This is where you would really want something which has both the beauty and all the strength of a statically typed language with the ease of access to data of a dynamic language. So F# has this phenomenal feature which, as far as I know, only exists in another language somewhat is Idris. If Haskell is too mainstream for you, Idris is where you would probably go next.

The type provider is like a compiler extension which is going to look at a schema and essentially, magically create types for you on the fly. So, this is really neat because now, if I use F# I can actually, say, target a JSON service, I just have to say “Hey. Here is the URL” and magically you have types and you can start working with it and discover data. It is statically typed so I do not have the risk of typos, I do not have runtime errors, so I love this. Then the other side is that because it is really a good, solid, .NET language, I can explore and then, with very little work, I can ship it in production and then it will just run like a champion. So, these are some of the reasons why I love F# for data science and machine learning.

   

3. You mentioned that F# was more suited for data processing and machine learning. Are object-oriented languages worse at that? What makes F# better? Is it the data structures, or what is it?

Besides the tooling, because you could really have a REPL in C# - Scriptcs is doing just that. I think the first half is the fact that the F# syntax is concise in a good way. Not concise in the Perl way where it is concise, but it looks like your cat fell asleep on your keyboard. That is just convenient for a scripting type of scenario. Objects are great and they have their space, but if I think about the prototypical machine learning problem is to take data, transform it in a certain way until I get the data in the shape I want and then once I have that in the shape I want, I will probably apply an algorithm and that algorithm is going to try to fit something in the data like a model and it will run and run, until the fit is good. And both these things are really things which are really more functional then object-oriented. The first part, in which I take data and move it towards the shape I want, is perfect for pipelining, like sequence mapping where I take data and I map it, and I map it, and I map it and I can just apply functions.

And this is where objects are just not really your friends because I do not want to create objects at every step of my pipeline – what it really is, is a function: I take data, I apply a function to it and I get something else. So, that is one side. The second side about having an algorithm which is just going to take the data until it converges. So that may be my bias, but my experience has been that it just fits really, really well with a functional style as well. It is like you are going to have a map, fold, reduce – the word MapReduce might sound familiar – and this is actually coming from a functional concept and that is another indication that the two worlds work really, really well together. It is not that there is anything wrong with OO, but in that context, it is like you do not really have an object owning the function. What you really want to deal with is like you have data which has whatever shape and apply functions to it until you really have what you want and that is really where a functional language shines.

   

4. You mentioned machine learning and you already mentioned these exploratory advantages in F# with the REPL. Are there any other advantages that F# brings to machine learning? Any special tooling out there? Anything else?

Well, I mentioned the type providers which are definitely one piece which is important and makes F# pretty unique. Another way is that the community has been building a lot of libraries – so that is an aside again, but one of the aspects which are truly exciting for me in F#, beyond the language and all of these, is the community which is absolutely awesome. People are just friendly, they are fun and they just build the tools from scratch. It is like one of these places where you say “Hey. I wish somebody had done this tool” and then they say “Yes, why don’t you do it?” – and people just do it. So, the community, as a whole, has been building more and more tools in this direction.

So, type providers is a general mechanism which you can use in F#, but the community has been building – there is this project called “F# Data” which contains a collection of type providers which has been contributed by the community. So, you have a type provider to CSV, JSON, XML, one on HTML which enables you to scrape data, so whatever data source you are targeting, it is probably there. So, that is an already great direction and the tools – then other people started building tools as well. Actually, one of the reasons which is awesome about type providers as well is that the obvious scenario is to access data easily, which is really, realistically, what 80% of your machine learning is. It is not a sexy part, it is not what people talk about, but that is really what you do and functional languages are good for that. But then there is this other tool which is kind of interesting as well. So, you can use type providers to do other things. If a type provider is going to look at a schema and create types for me on the fly, I can target data, but I could also target other things, like a language. And so, there is this one guy who thought “Why not try it?” and built a type provider to R. R is this language designed by statisticians for statisticians.

The joke goes like “If it is designed by statisticians as a language, then maybe it is not the best language you could imagine”, but for statisticians, it pretty much does what you could want to do with statistics, it has packages for everything. Now, with the type provider, what I can do is I can be in F# and target any package which has ever been written in R. So that is really nice because now I get all the power of F# to access data, to get all the mighty power of the .NET framework and I can use everything I have in R to complement what I do. So there is just this fantastic glue language where you could put things together and get things done.

Werner: I think you have given us a lot to look at. I think type providers are a big topic in F# and we will check it out. Thank you, Mathias.

One thing people want to do if they have any question or curiosity on F# is they should absolutely check fsharp.org, which is the home of the F# community and there are plenty of resources on machine learning and other topics.

Werner: You heard it, audience; we will all check it out. Thank you very much.

Thank you.

Mar 02, 2015

BT