If you're interested in functional programming, you might also want to checkout my second blog which i'm actively working on!!

Friday, February 15, 2013

Merging CSV and XML data with fast performance

This blogpost will show a simple demo of how you can merge XML data with additional data coming from a Comma Separated File. Again I just made up some testdata for demo purpose. But just to give some numbers, my real use case merged data from a 242MB XML file and a CSV file in less than 5 seconds using of course Saxon.
nxp10009@NXL01366 /c/xsltdemo
$ ls -la
total 4786
drwxr-xr-x    6 nxp10009 Administ        0 Feb 15 16:35 .
drwxr-xr-x    1 nxp10009 Administ    12288 Feb 15 16:19 ..
-rw-r--r--    1 nxp10009 Administ  9788993 Oct 30 13:42 Saxon-HE-9.4.jar
drwxr-xr-x    4 nxp10009 Administ        0 Feb 15 16:34 input
drwxr-xr-x    4 nxp10009 Administ        0 Feb 15 17:18 output
drwxr-xr-x    4 nxp10009 Administ        0 Feb 15 17:03 xslt


nxp10009@NXL01366 /c/xsltdemo/input
$ ls -la
total 1
drwxr-xr-x    4 nxp10009 Administ        0 Feb 15 16:34 .
drwxr-xr-x    6 nxp10009 Administ        0 Feb 15 16:35 ..
-rw-r--r--    1 nxp10009 Administ       47 Feb 15 16:46 studentinfo.csv
-rw-r--r--    1 nxp10009 Administ      576 Feb 15 17:13 students.xml

studentinfo.csv
1, m, developer
2, m, developer
3, f, model

students.xml
<?xml version="1.0" encoding="UTF-8" ?>
<students>
  <student>
    <firstname>Robby</firstname>
    <lastname>Pelssers</lastname>
    <dateofbirth>1977-02-07</dateofbirth>
    <studentid>1</studentid>
  </student>
  <student>
    <firstname>Ivan</firstname>
    <lastname>Lagunov</lastname>
    <dateofbirth>1987-04-30</dateofbirth>
    <studentid>2</studentid>
  </student>
  <student>
    <firstname>Pamela</firstname>
    <lastname>Anderson</lastname>
    <dateofbirth>1967-07-01</dateofbirth>
    <studentid>3</studentid>
  </student>  
</students>

So the first thing I did was creating an XML representation of that CSV file using below xslt.
studentinfo.xslt
<?xml version="1.0" encoding="UTF-8"?>
<!--
Author: Robby Pelssers

Transforms studentinfo.csv into XML representation

Usage from DOS-Shell:
java -jar Saxon-HE-9.4.jar -o:C:/xsltdemo/output/studentinfo.xml -it:main -xsl:C:/xsltdemo/xslt/studentinfo.xslt studentinfoCSV=file:/C:/xsltdemo/input/studentinfo.csv

-->

<xsl:stylesheet version="2.0" 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  
  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
  <xsl:param name="studentinfoCSV" />
  

  <xsl:variable name="linefeed" select="'\r?\n'"/>
  <xsl:variable name="csv" select="unparsed-text($studentinfoCSV)"/>  

  
  <xsl:template match="/" name="main">
    <info>
      <xsl:for-each select="tokenize($csv, $linefeed)">
        <xsl:variable name="fields" select="tokenize(., ',')"/>
        <studentinfo id="{normalize-space($fields[1])}">
          <gender><xsl:sequence select="normalize-space($fields[2])"/></gender>
          <profession><xsl:sequence select="normalize-space($fields[3])"/></profession>
        </studentinfo>  
      </xsl:for-each>
    </info>
  </xsl:template>

</xsl:stylesheet>

So after the transformation a new file studentinfo.xml gets generated as below.
<?xml version="1.0" encoding="UTF-8"?>
<info>
   <studentinfo id="1">
      <gender>m</gender>
      <profession>developer</profession>
   </studentinfo>
   <studentinfo id="2">
      <gender>m</gender>
      <profession>developer</profession>
   </studentinfo>
   <studentinfo id="3">
      <gender>f</gender>
      <profession>model</profession>
   </studentinfo>
</info>

So now we need to execute a second transform using as input students.xml and studentinfo.xml.
student_addinfo.xslt
<?xml version="1.0" encoding="UTF-8"?>
<!--
Author: Robby Pelssers

java -Xmx1024m -jar Saxon-HE-9.4.jar -s:C:/xsltdemo/input/students.xml -o:C:/xsltdemo/output/students-full.xml -xsl:C:/xsltdemo/xslt/student_addinfo.xslt studentinfoXML=file:/C:/xsltdemo/output/studentinfo.xml
-->

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
  <xsl:param name="studentinfoXML" />

  <xsl:variable name="studentinfoDocument" select="document($studentinfoXML)"/>

  <xsl:key name="student-lookup" match="studentinfo" use="@id"/>

  <xsl:template match="/">
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="student">
    <student>
      <!-- we copy all present attributes and children -->
      <xsl:apply-templates select="@* | node()"/>
      <!-- now we also want to add the additional information as children --> 
      <xsl:apply-templates select="key('student-lookup', studentid, $studentinfoDocument)/*"/>
    </student>
  </xsl:template>
 
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template> 


</xsl:stylesheet>


And finally we get the merged result:
<?xml version="1.0" encoding="UTF-8"?>
<students>
  <student>
      <firstname>Robby</firstname>
      <lastname>Pelssers</lastname>
      <dateofbirth>1977-02-07</dateofbirth>
      <studentid>1</studentid>
      <gender>m</gender>
      <profession>developer</profession>
   </student>
  <student>
      <firstname>Ivan</firstname>
      <lastname>Lagunov</lastname>
      <dateofbirth>1987-04-30</dateofbirth>
      <studentid>2</studentid>
      <gender>m</gender>
      <profession>developer</profession>
   </student>
  <student>
      <firstname>Pamela</firstname>
      <lastname>Anderson</lastname>
      <dateofbirth>1967-07-01</dateofbirth>
      <studentid>3</studentid>
      <gender>f</gender>
      <profession>model</profession>
   </student>  
</students>

No comments:

Post a Comment