Monday, November 7, 2011

Extracting XML data and loading into mysql database

This is my first blog post. I hope this goes well and people understands it easily.

Coming to context we are now going to use a xml to store data into mysql db.

Why XML: Most of the applications out there use xml as the transport medium so someday or other we have to face it.
Tutorial For whom :
  1. This tutorial is for all the people who want an easy way to parse xml, extract data and put the data into database.
  2. It's for the people who migrated from T-SQL and missing the excellent support of xml in mysql.( Atleast I miss it) This will serve as a workaround till the mysql core develops.
Tools needed : MySql database ( I tested it on v 5.5.8), Toad for MySql (optional)
Knowledge required on : extractValue() function of mysql, basic xml knowledge.

NOTE: All the important data will be highlighted in red colour and has courier new font.

Lets start with gaining some basic knowledge of extractValue() function used in MySql.
Well extractValue() as the name suggests it extracts text(cdata) from the xml element.
For more information please have a look at MySql XML Functions Documentation
A basic example of extractValue()
SELECT EXTRACTVALUE('<root><element>data</element></root>','/root/element') AS data;
Make sure that you connect to some database to test and type the above code into your Toad editor window.
You can see the result as below


Now lets decide our xml format. I choose this format to parse.
<params>
    <param>
        <paramname>paramname 1</paramname>
        <paramvalue>paramvalue 1</paramvalue>
    </param>
    <param>
        <paramname>paramname 2</paramname>
        <paramvalue>paramvalue 2</paramvalue>
    </param>
    <param>
        <paramname>paramname 3</paramname>
        <paramvalue>paramvalue 3</paramvalue>
    </param>
</params>
A little bit about xml format :
Here params is the root element under which many param's are present. Each param has a paramname and a paramvalue.

Why this structure :
In general a stored procedure will be having in, out, inout parameters. So if we know what is going in and what is coming out then we can manage inout parameters but think about a case where we don't know how many parameters are needed or else large number of parameters are needed lets say 20. In those cases having 20 parameters in stored procedure is not that easy to maintain for developer and for database caller who sends 20 parameters to it. Instead they both can use a common xml for understanding and parse the xml to get relevant data.In this way it's easy for both ends and it also provides a medium to transport data to one another. This medium can be used when the database is called by more than one application. Lets say a standalone application, web application, mobile application etc., (This is just my thinking . If something is drastically wrong please feel free to contradict ).

To extract the data from xml I wrote a small stored procedure to explain things.
There are 6 steps in doing that

1) Setting up xml as input parameter for stored procedure.

Create a stored procedure something like this
DROP PROCEDURE IF EXISTS newdb1.import_xml;
CREATE PROCEDURE newdb1.`import_xml`(params  TEXT)
BEGIN
END;
Here params is given as text because our xml can grow very long in real time scenarios.

 2) Declaring necessary variables.
DECLARE rowcount INT UNSIGNED DEFAULT 0;
DECLARE paramcount INT UNSIGNED;
DECLARE paramname VARCHAR(255);
DECLARE paramvalue VARCHAR(255);
Here rowcount is to iterate through param's in xml, paramcount is number of param elements in xml, paramname and paramvalue are variables to store extracted data from respective elements.

3) A temporary table to store the extracted data.
CREATE TEMPORARY TABLE temp(paramname VARCHAR(255), paramvalue VARCHAR(2000));
Please note that paramname has given 255 characters and paramvalue is given 2000 because paramname is value of variable name in business context and we don't expect people write stories as variable names and paramvalue is value of variable and here we can expect large data so we gave it 2000. It depends on your business context which length suffices you. Please change as needed.

4) Count the number of params in whole xml.
SET paramcount := EXTRACTVALUE(params,'COUNT(/params/param)');
This will give count of params in xml which in our case is 3.

5) Iterate over each param, extract paramname, paramvalue and inserting them into temp table.
WHILE rowcount < paramcount DO
    SET rowcount := rowcount + 1;
    SET paramname := CONCAT('/params/param[',rowcount,']/paramname[1]');
    SET paramvalue := CONCAT('/params/param[',rowcount,']/paramvalue[1]');
    INSERT INTO temp(paramname, paramvalue) VALUES(EXTRACTVALUE(params, paramname), EXTRACTVALUE(params, paramvalue));
END WHILE;
Here is the interesting part.
A while loop which iterates over all the param elements, extracts data from elements and then it inserts into temp table.

Note : If we use concat('params/param/paramname') then we will get 3 matching elements and if we extract, then all the data in those 3 elements will get concatenated separated by comma which we don't want. So we iterate over elements based on our row count.
Below is the result if we use concat as above


6) Deleting temporary table for the subsequent uses.
DROP TABLE temp;
This deletes the temporary table created and it is a good practice to remove resources when we don't need them anymore. We can have a select statement before deleting to see what data we have got in our table. But in real time scenarios we might not need them.

Note : Please comment all the statements which are for testing, unnecessary select statements and unload all unnecessary resources from memory after completion of task.

Now the final stored procedure with all the bits attached together.
DROP PROCEDURE IF EXISTS newdb1.import_xml;
CREATE PROCEDURE newdb1.`import_xml`(
params                     TEXT
)
/*
'<params>
    <param>
        <paramname>paramname 1</paramname>
        <paramvalue>paramvalue 1</paramvalue>
    </param>
    <param>
        <paramname>paramname 2</paramname>
        <paramvalue>paramvalue 2</paramvalue>
    </param>
    <param>
        <paramname>paramname 3</paramname>
        <paramvalue>paramvalue 3</paramvalue>
    </param>
</params>'
*/

BEGIN
    DECLARE rowcount INT UNSIGNED DEFAULT 0;
    DECLARE paramcount INT UNSIGNED;
    DECLARE paramname VARCHAR(255);
    DECLARE paramvalue VARCHAR(255);

    CREATE TEMPORARY TABLE temp(paramname VARCHAR(255), paramvalue VARCHAR(2000));

    -- calculate the number of row elements.
    SET paramcount := EXTRACTVALUE(params,'COUNT(/params/param)');

    -- loop through all the row elements
    WHILE rowcount < paramcount DO
        SET rowcount := rowcount + 1;
        SET paramname := CONCAT('/params/param[',rowcount,']/paramname[1]');
        SET paramvalue := CONCAT('/params/param[',rowcount,']/paramvalue[1]');
        INSERT INTO temp(paramname, paramvalue) VALUES (EXTRACTVALUE(params, paramname), EXTRACTVALUE(params, paramvalue));
    END WHILE;
    SELECT * FROM temp;
    DROP TABLE temp;
END;
One can call the stored procedure as below. I used concat just to make it visually better for readers but there is no need for that.
CALL import_xml(CONCAT('<params>'
'<param><paramname>paramname 1</paramname><paramvalue>paramvalue 1</paramvalue></param>' 
'<param><paramname>paramname 2</paramname><paramvalue>paramvalue 2</paramvalue></param>'
'<param><paramname>paramname 3</paramname><paramvalue>paramvalue 3</paramvalue></param>'
'</params>') 
);
The final output looks something like this
Since we have the data in temp table we can do anything with it. After this stage we don't use xml to get and set data.