Unmarshalling XML without a namespace using JAXB

Ok, this is just a short post on unmarshalling XML which doesn’t have a namespace but the objects that you have do have a namespace.

Right, first you’ll need two XSDs which have some objects that you’ll be receiving in String XML format:

vsf-payload.xsd

<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.v-s-f.co.uk/payload" 
	xmlns="http://www.w3.org/2001/XMLSchema"
	xmlns:tns="http://www.v-s-f.co.uk/payload"
	xmlns:vsfd="http://www.v-s-f.co.uk/domain">
	
	<import namespace="http://www.v-s-f.co.uk/domain" schemaLocation="vsf-domain.xsd" />
	
	<element name="VsfPayload" type="tns:VsfPayload" />

	<complexType name="VsfPayload">
		<sequence>
			<element name="VsfDomain" type="vsfd:VsfDomain" />
		</sequence>
	</complexType>
</schema>

vsf-domain.xsd

<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.v-s-f.co.uk/domain" 
	xmlns="http://www.w3.org/2001/XMLSchema"
	xmlns:tns="http://www.v-s-f.co.uk/domain">
	
	<element name="VsfDomain" type="tns:VsfDomain" />

	<complexType name="VsfDomain">
		<sequence>
			<element name="Hello" type="string" />
		</sequence>
	</complexType>
</schema>

In order to get the above XSDs generated into objects we can use the XJC plugin for maven:

<plugin>
				<groupId>org.apache.cxf</groupId>
				<artifactId>cxf-xjc-plugin</artifactId>
				<version>2.3.0</version>
				<configuration>
					<extensions>
						<extension>org.apache.cxf.xjcplugins:cxf-xjc-dv:2.3.0</extension>
					</extensions>
				</configuration>
				<executions>
					<execution>
						<id>generate-sources</id>
						<phase>generate-sources</phase>
						<goals>
							<goal>xsdtojava</goal>
						</goals>
						<configuration>
							<sourceRoot>${basedir}/target/generated/src/main/java</sourceRoot>
							<xsdOptions>
								<xsdOption>
									<xsd>${basedir}/src/main/resources/vsf-payload.xsd</xsd>
								</xsdOption>
							</xsdOptions>
						</configuration>
					</execution>
				</executions>
			</plugin>

Run a clean install to generate the objects.

Open the created class VsfPayload and add a root element tag to the top:

@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "VsfPayload", propOrder = {
    "vsfDomain"
})
@XmlRootElement
public class VsfPayload {

Create a test class for testing the unmarshalling as follows:

package uk.co.vsf.utilities;

import java.io.StringReader;
import static org.testng.Assert.*;
import java.io.StringWriter;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Marshaller;
import javax.xml.bind.Unmarshaller;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.testng.annotations.Test;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;

import uk.co.v_s_f.domain.VsfDomain;
import uk.co.v_s_f.payload.VsfPayload;

public class XMLTrial {

	@Test
	public void testMarshall() throws JAXBException
	{
		marshallHelloWorld();
	}
	
	private String marshallHelloWorld() throws JAXBException
	{
		VsfPayload payload = new VsfPayload();
		VsfDomain domain = new VsfDomain();
		domain.setHello("Hello World");
		payload.setVsfDomain(domain);
		
		return writeXMLOut(payload);
	}
	
	
	
	private String writeXMLOut(VsfPayload payload) throws JAXBException
	{
		JAXBContext jc = JAXBContext.newInstance(VsfPayload.class.getPackage().getName());
		//Create unmarshaller
		Marshaller marshaller = jc.createMarshaller();
		//Unmarshal XML contents of the file myDoc.xml into your Java 
		StringWriter sw = new StringWriter();
		
		marshaller.marshal(payload, sw);
		
		return sw.getBuffer().toString();
	}
}

The method, testMarshall, writes out the XML string produced when marshalling the object and shows you what you should be expecting if you’d received a well formed XML message complete with namespaces.

Now we add a new method to the test class which will unmarshall a string that has no namespace information (so the complete file is now…):

package uk.co.vsf.utilities;

import java.io.StringReader;
import static org.testng.Assert.*;
import java.io.StringWriter;

import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBException;
import javax.xml.bind.Marshaller;
import javax.xml.bind.Unmarshaller;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;

import org.testng.annotations.Test;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;

import com.testexample.payload.ContentWrapper;

import uk.co.v_s_f.domain.VsfDomain;
import uk.co.v_s_f.payload.VsfPayload;

public class XMLTrial {

	@Test
	public void testMarshall() throws JAXBException
	{
		marshallHelloWorld();
	}
	
	private String marshallHelloWorld() throws JAXBException
	{
		VsfPayload payload = new VsfPayload();
		VsfDomain domain = new VsfDomain();
		domain.setHello("Hello World");
		payload.setVsfDomain(domain);
		
		return writeXMLOut(payload);
	}
	
	@Test
	public void domToObjects() throws Exception
	{
		String xml = "<ContentWrapper><VsfPayload><VsfDomain><Hello>Hello World</Hello></VsfDomain></VsfPayload></ContentWrapper>";
		DocumentBuilder db = DocumentBuilderFactory.newInstance().newDocumentBuilder();
	    InputSource is = new InputSource();
	    is.setCharacterStream(new StringReader(xml));

	    Document doc = db.parse(is);
		
		
		
		JAXBContext jc = JAXBContext.newInstance(ContentWrapper.class.getPackage().getName());
		//Create unmarshaller
		Unmarshaller unmarshaller = jc.createUnmarshaller();
		//Unmarshal XML contents of the file myDoc.xml into your Java 
		StringWriter sw = new StringWriter();
		
		ContentWrapper wrapper = (ContentWrapper) unmarshaller.unmarshal(doc.getFirstChild());
		
		assertEquals(writeXMLOut(wrapper.getVsfPayload()), marshallHelloWorld());
	}
	
	private String writeXMLOut(VsfPayload payload) throws JAXBException
	{
		JAXBContext jc = JAXBContext.newInstance(VsfPayload.class.getPackage().getName());
		//Create unmarshaller
		Marshaller marshaller = jc.createMarshaller();
		//Unmarshal XML contents of the file myDoc.xml into your Java 
		StringWriter sw = new StringWriter();
		
		marshaller.marshal(payload, sw);
		
		return sw.getBuffer().toString();
	}
}

But there’s an object in that new method which won’t compile without it being created. ContentWrapper is another xjc created class so add another XSD file:
wrapper.xsd

<?xml version="1.0" encoding="UTF-8"?>
<schema targetNamespace="http://www.testexample.com/payload" 
	xmlns="http://www.w3.org/2001/XMLSchema"
	xmlns:tns="http://www.testexample.com/payload"
	xmlns:vsfp="http://www.v-s-f.co.uk/payload">
	
	<import namespace="http://www.v-s-f.co.uk/payload" schemaLocation="vsf-payload.xsd" />
	
	<element name="ContentWrapper" type="tns:ContentWrapper" />

	<complexType name="ContentWrapper">
		<sequence>
			<element name="VsfPayload" type="vsfp:VsfPayload" />
		</sequence>
	</complexType>
</schema>

That new XSD contains a single element ContentWrapper which contains the VsfPayload object. Change your pom to reference the wrapper file, not the payload file:

...
    <xsd>${basedir}/src/main/resources/wrapper.xsd</xsd>
...

And regenerate the objects by running a clean install. Re-edit the VsfPayload object to add the root element tag. Also edit the ContentWrapper class to add the following xml root elemetn tag:

@XmlAccessorType(XmlAccessType.FIELD)
@XmlType(name = "ContentWrapper", propOrder = {
    "vsfPayload"
})
@XmlRootElement(name="ContentWrapper", namespace="")
public class ContentWrapper {

Now your test class should compile. The second test method in the test class, domToObjects, will pass an XML string into a Document and then get the first node, ContentWrapper and then unmarshall that node. Following that, it will marshall the objects to confirm that the string representation looks exactly like the first method, testMarshall.

Anyway, that’s one solution for unmarshalling xml that doesn’t have a namespace. There’s probably a way to add a binding override to force ContentWrapper to be created with a blank namespace and a root element tag too.

GU10 follow up – price checking

This is really a follow up to http://blog.v-s-f.co.uk/2011/12/the-real-cost-of-gu10s/.  I’ve been monitoring the prices of the new 4W LED GU10s on CPC for a few weeks now.  When CPC adds a product to their website, they add the base product code, E.g. LP04169 and then there are up to 99 further product code permutation possible.  E.g. LP04169[01-99].  For the past three weeks I’ve been checking all the codes manually (typing in the product code into the search and visually identifying the cheapest product code.

So today (yes… I know it’s Christmas day, but I couldn’t sleep…) I decided to automate the process.  I started off by writing a program using Mule 3.2, but Mule doesn’t follow the redirects and load the Javascript correctly.  So my next choice was Selenium, but I don’t like the fiddly UI for firefox.  I created the bare basics for a Selenium test and then found I could export the test to Java!  I last used Java Selenium 2 years ago and it wasn’t all that easy, but they’ve vastly improved it since then and added Maven support.  I then proceeded to re-write the code it automatically generated and added in what I really wanted to achieve – price checking.

Below is the code required to check a product on CPC for the lowest price available:

(I’ve used TestNG rather than Junit as it was the test framework already imported for the previous utilities class, but it could be substituted for Junit.)

/**
 * <p>Copyright 2011 Victoria Scales</p>
 * <p>This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.</p>
 * <p>This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for more details.</p>
 * <p>You should have received a copy of the GNU General Public License along with this program; if not, see <http://www.gnu.org/licenses/>.</p>
 */
package uk.co.vsf.utilities;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.concurrent.TimeUnit;

import org.openqa.selenium.By;
import org.openqa.selenium.NoSuchElementException;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.firefox.FirefoxDriver;
import org.testng.annotations.AfterMethod;
import org.testng.annotations.BeforeMethod;
import org.testng.annotations.Test;

/**
 * Simple class for checking the price of a CPC product to find the cheapest code.
 * 
 * @author Victoria
 * @date 2011-12-25
 * @version 0.1
 */
public class CPCPriceCheckerSelenium
{
    // set the driver to the firefox driver by default
    private WebDriver driver = new FirefoxDriver();
    // set the url to the website you wish to check
    private String baseUrl = "http://cpc.farnell.com/";

    @BeforeMethod
    public void setUp() throws Exception
    {
        // this configures a timeout if a selection is not successful after X seconds.
        driver.manage().timeouts().implicitlyWait(10, TimeUnit.SECONDS);
    }

    @Test
    public void productPriceCheck() throws Exception
    {
        Map<String, Double> pricesFound = new HashMap<String, Double>();
        final int pagesPossible = 100;
        for (int i = 0 ; i < pagesPossible ; i++)
        {
            try
            {
                String product = "LP04169" + pageExtension(i);
                driver.get(baseUrl + product);

                // select the price off the page (if possible)
                // if the text is not found, NoSuchElementException is thrown.
                String price = driver.findElement(By.cssSelector("span.taxedvalue")).getText();
                price = price.replace("(£", "");
                price = price.replace(")", "");

                double priceAsMonetary = Double.parseDouble(price);

                pricesFound.put(product, priceAsMonetary);
            }
            catch (NoSuchElementException e)
            {

            }

            // wait for 10 seconds so as not to spam the website and get blocked...
            Thread.sleep(10000);
        }

        List<String> productCodes = new ArrayList<String>();
        productCodes.addAll(pricesFound.keySet());
        Collections.sort(productCodes);

        for (String instance : productCodes)
        {
            System.out.println(instance + " " + pricesFound.get(instance));
        }

        // cheapest price

        List<Double> pricesOrdered = new ArrayList<Double>();
        pricesOrdered.addAll(pricesFound.values());
        Collections.sort(pricesOrdered);

        System.out.println();
        System.out.println("Cheapest price: £" + pricesOrdered.get(0));
        System.out.println("Most expensive price: £" + pricesOrdered.get(pricesOrdered.size() - 1));
    }

    private String pageExtension(int i)
    {
        if (i == 0)
        {
            return "";
        }
        else if (i > 9)
        {
            return "" + i;
        }

        return "0" + i;
    }

    @AfterMethod
    public void tearDown() throws Exception
    {
        driver.quit();
    }
}

To use Selenium with Maven add the following to your pom:

<dependency>
    <groupId>org.seleniumhq.selenium</groupId>
    <artifactId>selenium-java</artifactId>
    <version>2.15.0</version>
</dependency>

Above average site scrape

If like me you check your site logs most every days, you’ve probably noticed one or two site scrapes. Fortunately most are from the same user (or bot) and use the same, ip, host and browser summary. They grab 20 or so pages in a minute and then stop and come back another day to finish the site off. On a bad day, they might rip the entire site in one pass… With my plugin, targeting those individuals is easier than it used to be; add the browser, host or ip to the block rules in vsf-simple-block and no more annoying user, E.g. host = ‘amazonaws.com’

But the other week, I had another visit by an above average site scraping “team.” I’ve only had about 3 of these types of site scrapes before, so I don’t automatically twig and tend to give the benefit of the doubt.

I happened to be looking at my log, when I saw the first 4 hits shown in the image below. With so little hits, I assumed it was just coincidence that they all had the same browser summary and kept an eye on it. About ten minutes later though, the hits were still coming in. A different url, random ip and host but the same browser summary. Blocking on browser summary isn’t my favourite choice, as it can potentially exclude a lot of users who are completely innocent, but I added the rule to the block plugin and prevented another hour’s worth of hits. When blocked, the number of hits increased three-fold.

 

 

I’m still in the thinking stage on how to code a solution to prevent this type of site scrape. As you can see from the image, between the first hit and the last before being blocked, there is a gap of nineteen minutes. My site doesn’t have a high volume of hits, so all the hits are clumped together. Clumped hits are relatively easy to detect patterns in, but statistically there would be a hit by a genuine user in the middle which would make the logic harder to detect.