There is no shortage of gems out there to do exactly the aforementioned but today I'm just focusing on two in particular: Ox and the
Nokogiri::XML::SAX::Parser. I went down this road after diving head first into a backtrace mentioning the two most dreadful words known to man:
Before I get into the segmentation fault, I wanted to dust off the performance comparison initially done by Peter Ohler, the creator of the
Ox gem. If you perused the README of that gem, you'd see a performance comparison between
Nokogiri which can be found in more detail here.
The cliff notes related to Sax Parsing (perf_sax.rb) I wanted to mention here are:
- Ox was 39.7 times faster than Nokogiri SAX parsing using file IO. - Ox was 13.3 times faster than LibXML SAX parsing using file IO.
Since that blog post was dated September 21, 2011, I first wanted to re-run against the latest gems for all three:
- ox (2.0.11)
- nokogiri (1.6.0)
- libxml-ruby (2.7.0)
The results directly against Peter's blog (for Sax Parsing):
A 1000 KByte XML file was parsed 100 times for this test. Ox::Sax.sax_parse 100 times in 0.452 seconds or 221.458 sax_parse/sec. Nokogiri::XML::Sax.parse 100 times in 12.954 seconds or 7.719 parse/sec. LibXML::XML::Sax.parse 100 times in 4.275 seconds or 23.394 parse/sec. >>> Ox is 28.69 faster than Nokogiri SAX parsing using file IO. >>> Ox is 9.47 faster than LibXML SAX parsing using file IO.
So, though Nokogiri and LibXML gained some ground since 2011, Ox clearly still takes the cake. Now, onto the segmentation fault.
The reason this error happened was some invalid XML, the culprit of which can be seen in more detail here. The XML comes from a third party so aside from working with them to fix it, we found that not only was
Nokogiri more forgiving than
Ox, which Seg Faulted, with Nokogiri, the error callback gives us some helpful error messaging which could easily trigger a notification with Airbrake or HoneyBadger:
["Start tag expected, '<' not found\n", "expected '>'\n", "StartTag: invalid element name\n", "Opening and ending tag mismatch: SampleNode4 line 4 and SampleNode2\n", "StartTag: invalid element name\n", "error parsing attribute name\n", "attributes construct error\n", "Couldn't find end of Start Tag subnode1Unknown line 7\n", "Opening and ending tag mismatch: Metadata line 3 and SampleNode3a\n", "expected '>'\n", "Opening and ending tag mismatch: SampleXML line 2 and bSampleNode4\n", "Extra content at the end of the document\n"]
The performance is clearly much faster using
Ox and there's no arguing against the numbers above -- but if your traffic allows and you might not have much control over the XML coming through, it might be worth exploring
Nokogiri or another library.
In our initial tests, parsing a 6 MB XML file with our callbacks took:
- 2.015007 seconds with Nokogiri
- 0.262095 seconds with Ox
This all comes down to personal preference and other dependencies but if you're ok with an extra second or two in processing, the bonus in error messaging might save you some headache. Cheers.