BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Processing Data with Kotlin Dataframe Preview

Processing Data with Kotlin Dataframe Preview

Bookmarks

Kotlin DataFrame, available as a first public preview, is a new library for processing data from any source, such as CSV, JSON, Excel and Apache Arrow files. The DataFrame library works together with Kotlin data classes and hierarchical data schemas by using a Domain Specific Language (DSL).

A data frame is a two dimensional table with labeled columns, comparable to spreadsheets, SQL tables, or CSV files. The Kotlin DataFrame GitHub repository contains more information about the set of operations on the data, available through a DSL, supporting data analysis. The library started as a wrapper around the Krangl library, but most of the functionality was rewritten over time.

The Kotlin DataFrame library was designed to read and display data from any source and allows nesting columns and cells. Any Kotlin object or collection can be stored and retrieved from a DataFrame.

Data may be supplied via a file such as students.csv:

firstName, lastName, country
Akmad, Desiree, The Netherlands
Serik, Chuy, India
Ioel, Jan, Belgium
Draco, Arti, Argentina
Myrna, Hyginos, Bolivia
Dalila, Dardanos, Belgium

The DataFrame is created based on the contents of the file:

val studentsDataFrame = DataFrame.read("students.csv")
print(studentsDataFrame.head())

By default, the head() method returns the first five rows:

    firstName lastName         country
 0 	    Akmad  Desiree The Netherlands
 1 	    Serik     Chuy           India
 2       Ioel      Jan         Belgium
 3 	    Draco 	  Arti       Argentina
 4 	    Myrna  Hyginos         Bolivia

Alternatively, the DataFrame may be created programmatically:

val firstName by columnOf("Akmad", "Serik", "Ioel", "Draco", "Myrna", "Dalila")
val lastName by columnOf("Desiree", "Chuy", "Jan", "Arti", "Hyginos", "Dardanos")
val country by columnOf("The Netherlands", "India", "Belgium", "Argentina", "Bolivia", "Belgium")

By supplying the head() method with an argument, the number of elements may be defined, in this case two:

val customDataFrame = dataFrameOf(firstName, lastName, country)
print(customDataFrame.head(2))
    firstName  lastName          country
 0 	    Akmad   Desiree  The Netherlands
 1 	    Serik      Chuy            India

The API offers various ways to retrieve specific types of data such as the contents of the first row:

println(studentsDataFrame.get(0).values())
[Akmad, Desiree, The Netherlands]

Alternatively, a specific column, such as the country column may be retrieved:

println(studentsDataFrame.getColumnOrNull(2)?.values())
[The Netherlands, India, Belgium, Argentina, Bolivia, Belgium]

More advanced API methods allow, for example, sorting and removing elements from a DataFrame:

println(studentsDataFrame.sortBy("firstName").remove("country"));
    firstName  lastName
 0 	    Akmad   Desiree
 1	   Dalila  Dardanos
 2 	    Draco 	   Arti
 3       Ioel       Jan
 4 	    Myrna   Hyginos
 5 	    Serik 	   Chuy

A DataSchema annotation may be specified to improve the data handling:

@DataSchema
interface Student {
   val firstName: String
   val lastName: String
   val country: String
}

Now the schema can be used together with the DataFrame API to filter on the country field of Student and sort on the firstName field of Student:

val studentsDataFrame = DataFrame.read("students.csv")

println(studentsDataFrame.filter{it[Student::country] != "Belgium"}.sortBy(Student::firstName))
    firstName  lastName         country
 0 	    Akmad   Desiree The Netherlands
 1 	    Draco 	   Arti       Argentina
 2 	    Myrna   Hyginos         Bolivia
 3 	    Serik      Chuy           India

More information can be found in the first video about Kotlin DataFrame which covers the basic operations and processing tables. Various examples are available and the #datascience channel on Slack may be used to ask questions after signing-up.

About the Author

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Spark as an ETL or Kotlin Frames?

    by Ricardo Legorreta,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    We use a lot Kotlin for Spring micro.services Neo4j, Spring Data and Spring DSL. We are development a micro.service to do al import , validation data operations using Spsrl (in Sclala). Spark is by far to big to develop just an ETL.
    What comments do you have to use Spark just as an ETL vs use Kotlin Dataframes?

    Ricardo

  • Re: Spark as an ETL or Kotlin Frames?

    by Johan Janssen,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Interesting questions and as often the answer is 'it depends' :).

    From my experience, it's nice for developers to use all kinds of languages. However, for the organization it's quite a risk in the long run. Maintaining and supporting multiple languages at the same level is a challenge and it might become hard to find developers for a specific language. I know for instance that quite some companies struggle to find Scala developers for their (older) applications. From that point of view I would say it's good to use one language for everything the team is doing, in this case it might be Kotlin.

    If you don't need the performance/scalability of Spark then it might be good to use Kotlin. However I have to say that I quite liked Spark, especially with large datasets, such as reading large files. So it also depends a bit on which kind of data you're processing.

  • Re: Spark as an ETL or Kotlin Frames?

    by Ricardo Legorreta,

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Johan, great comments,

    Since my environment is not “Big data” we are taking about hundred or thousands of nodes at most. But your point to need Scala developer is stronger (I never thought it), I know Scala for many years and still I’m far to be an Scala expert, and more, when I read a two year old Scala class that I programmed it takes me time to understand what I did (e.g., use for intrinsics inside my classes). Scala is very powerful but algo sometimes very difficult to understand for team developers.

    I have one more week to start the project, but with your two comments it deserves to try with Kotlin Data Frames first than use Spark.

    One difficult goal is how I plan to implement “validations”; they have to be dynamic. It means without re-compiling all the ingestor micro.service, the “power-user” can add or modify validations. So far the best solution I have is to implement the validations in Phyton. The “power user” can edit the Phyton file and with just “touch” the file they can be used again (even inside docker). I call this concept User Defined Validations “UDV”. In Kotlin since it is compiled I don’t see other solution, or use Jupiter handbook. The problem is that the editor of the UDV has to be inside UI of my micro.service and not a separate app like Jupiter notebook.

    Again thanks for your help.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT