Rhino - ETL

Using Rhino-ETL ( a C# based) framework for developing standard ETL's is pretty easy and one can do a lot of fun stuff with the underlying source data. I just wrote up a quick console app to generate data into a text file and push the same data into a table in SQL Server as well as an external file. Here are the following steps (for the external file push):
1. Create a new C# console application solution in Visual Studio.
2. Target the .Net framework as shown in the below screen shot in your project properties:-

3. Create 3 Sub Folders underneath your project as shown in the following screen shot
DataObjects --> Contains the class files associated with each and every table/file in your environment. Example:- if your source file contains student data, then you would create a class file called Student with the individual properties (in relation to the properties) exposed (nouns).
Operations--> This primarily contains the class files that contain the activities (adjectives) that need to be performed on the DataObjects. Example:- Writing the Student Data to a database, Reading the Student data from a file etc.
WorkFolder--> Contains the external file sources to interact with. Example:- a flat file, a csv or a tsv. In this case it will be student.txt.


Lets write some code to insert a student record from a flat file into another flat file.......(as simple as it sounds)

4. Create a class file called StudentRecord.cs (pipe delimited) and declare the required entity attributes as shown in the following code snippet:-
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using FileHelpers;
namespace FirstSampleRhinoETL.DataObjects
{
[DelimitedRecord("|"), IgnoreCommentedLines("Commenting Lines"), IgnoreFirst]
class StudentRecord
{
int id;
string name;
string address;
int classid;
int marks;
public int StudentId
{
get { return id; }
set { id = value; }
}
public string StudentName
{
get { return name; }
set { name = value; }
}
public string StudentAddress
{
get { return address; }
set { address = value; }
}
public int StudentClassId
{
get { return classid; }
set { classid = value; }
}
public int StudentMarks
{
get { return marks; }
set { marks = value; }
}
}
}
view raw gistfile1.txt hosted with ❤ by GitHub
Contains records in the following manner (student.txt)
StudentId|StudentName|StudentAddress|StudentClassId|StudentMarks //header
1|Ishwar|TestAddress|1|85 //row

5. Create a class file called NewStudentRecord (which contains the attributes that need to be transferred to the new file)
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
namespace FirstSampleRhinoETL.DataObjects
{
using FileHelpers;
[DelimitedRecord("\t"), IgnoreFirst]
public class UserFullRecord
{
public int sId;
public string sName;
public string sAddress;
public int sclass;
}
}
This will be outputted in the following manner(tab separated)
StudentId\tStudentName\tStudentAddress\tStudentClass
1\tIshwar\tTestAddress\t1

Let us now create the action called student write i.e. Let us go about writing this out to an output file called studentoutput.txt and I am creating a new C# class file called StudentWriteFile which will be as shown in the following code snippet:-
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using FileHelpers;
using Rhino.Etl.Core.Files;
using Rhino.Etl.Core.Operations;
using Rhino.Etl.Core;
using FirstSampleRhinoETL.DataObjects;
namespace FirstSampleRhinoETL.Operations
{
class StudentWriteFile : AbstractOperation
{
public StudentWriteFile(string filePath)
{
this.filePath = filePath;
}
string filePath = null;
public override IEnumerable<Row> Execute(IEnumerable<Row> rows)
{
FluentFile engine = FluentFile.For<NewStudentRecord>();
engine.HeaderText = "Id\tsName\tsAddress\tsclass";
using (FileEngine file = engine.To(filePath))
{
foreach (Row testRow in rows)
{
Row row = new Row();
//row.Copy(leftRow);
//copy over all properties not in the student records
row["sId"] = testRow["StudentId"];
row["sName"] = testRow["StudentName"];
row["sAddress"] = testRow["StudentAddress"];
row["sclass"] = testRow["StudentClassId"];
file.Write(row.ToObject<NewStudentRecord>());
//pass through rows if needed for another later operation
yield return row;
}
}
}
}
}
view raw StudentWrite hosted with ❤ by GitHub

Now let us go about writing the main program... create the main program in the following manner:-
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using Rhino.Etl.Core;
using FirstSampleRhinoETL.Operations;
using Rhino.Etl.Core.Operations;
using Rhino.Etl.Core.Pipelines;
namespace FirstSampleRhinoETL
{
public class MainProgram: EtlProcess
{
protected override void Initialize(){
Register(new StudentRead(Settings.Default.StudentFile));
Register(new StudentWriteFile(Settings.Default.OutputFile));
}
protected override void PostProcessing(){
base.PostProcessing();
foreach (var error in GetAllErrors())
throw error;
}
}
}
view raw MainProgram hosted with ❤ by GitHub
The setting's values basically point to the settings files that I have created which contains the absolute path of the student.txt and the studentoutput.txt files.

After which in your main just initialize the MainProgram in the following manner:-
 new MainProgram().Execute();

and you will have your first rhino-etl to rock and roll with......



Comments

Ashwin said…
Thanks for the blog.. Ive been looking for something like this to learn Rhino ETL for quite a while.. Was surfing the net before you put this article.

Anyhow I have a small doubt with the example you have provided.

1)"StudentRead" Cant find this class.
Ishwar said…
Great catch Ashwin.... must have added the class at a later stage...
public StudentRead(string filePath) { this.filePath = filePath; }
string filePath = null; public override IEnumerable Execute(IEnumerable rows) { FluentFile engine = FluentFile.For(); engine.HeaderText = "Id\tsName\tsAddress\tsclass"; using (FileEngine file = engine.To(filePath)) { foreach (Row testRow in rows) { Row row = new Row(); //row.Copy(leftRow); //copy over all properties not in the student records row["sId"] = testRow["StudentId"]; row["sName"] = testRow["StudentName"]; row["sAddress"] = testRow["StudentAddress"]; row["sclass"] = testRow["StudentClassId"]; file.Read(row.ToObject()); //pass through rows if needed for another later operation yield return row; } }

Popular posts from this blog

Microsoft acquires LinkedIn

Redshift Experience