Scraping the Web

Intro to Nokogiri, Part 1

Christopher Hague
Aug 27, 2017 · 8 min read

I have a friend who works in sales and says he spends a significant portion of his day copying potential clients’ phone numbers from a popular website and pasting them into a Excel spreadsheet. Not only is this an extremely inefficient use of his time, but it’s also mundane. Can you imagine having to spend several hours a week visiting a website only to copy a phone number into a spreadsheet? Surely there must be a better way…

What is scraping?

Scraping is the act of parsing an HTML document and gathering (or scraping — see what I did there) certain information from it. It’s pretty straightforward: Look at a document, identify the information you want to grab and then grab it. This post is going to be a brief introduction to using Nokogiri and Open-URI to make requests to a webpage and scrape information from them.

What is Open-URI?

Open-URI is a Ruby Module that enables applications to make HTTP requests. In order to make the post as concise as possible we will only be working with the open method (which we will get back to shortly). Feel free to reference the Open-URI docs here if you want additional information.

What is Nokogiri?

Remember how Open-URI enables us to make HTTP requests? Nokogiri takes the response from that request and returns the HTML as a string in a collection of nested nodes. Once we receive this response, we can use Nokogiri to extract the desired information. As we did with Open-URI, we will focus on only just two Nokogiri methods to help us accomplish this — .css and .text.

Testing

How about an example to get an idea of what our response might look like? Let’s define a method that will do two things: 1) make a HTTP request to a webpage, and 2) use Nokogiri to convert the response into a bunch of nested nodes (NodeSet). Let’s see what the Wikipedia article on Napoleon has in store for us:

Napoleon
def sample
url = "https://en.wikipedia.org/wiki/Napoleon"
doc = Nokogiri::HTML(open(url))
p doc
end

Print the method at the bottom of the file, then run the file in your terminal. The return value looks something like this:

#<Nokogiri::HTML::Document:0x3fd069489508 name="document" children=[#<Nokogiri::XML::DTD:0x3fd069488680 name="html">, #<Nokogiri::XML::Element:0x3fd069481c54 name="html" attributes=[#<Nokogiri::XML::Attr:0x3fd069481a60 name="class" value="client-nojs">, #<Nokogiri::XML::Attr:0x3fd069481a38 name="lang" value="en">, #<Nokogiri::XML::Attr:0x3fd069481a24 name="dir" value="ltr">] children=[#<Nokogiri::XML::Text:0x3fd069029094 "\n">, #<Nokogiri::XML::Element:0x3fd069028a04 name="head" children=[#<Nokogiri::XML::Text:0x3fd06951d6cc "\n">, #<Nokogiri::XML::Element:0x3fd06951d370 name="meta" attributes=[#<Nokogiri::XML::Attr:0x3fd06951cf74 name="charset" value="UTF-8">]>, #<Nokogiri::XML::Text:0x3fd069019be4 "\n">, #<Nokogiri::XML::Element:0x3fd0690187bc name="title" children=[#<Nokogiri::XML::Text:0x3fd069011da4 "Napoleon - Wikipedia">]>, #<Nokogiri::XML::Text:0x3fd069010378 "\n">, #<Nokogiri::XML::Element:0x3fd069010fbc name="script" children=[#<Nokogiri::XML::CDATA:0x3fd069509dac "document.documentElement.className = document.documentElement.className.replace( /(^|\\s)client-nojs(\\s|$)/, \"$1client-js$2\" );">]>, #<Nokogiri::XML::Text:0x3fd069508178 "\n">, #<Nokogiri::XML::Element:0x3fd069501d8c name="script" children=[#<Nokogiri::XML::CDATA:0x3fd0694e815c "(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({\"wgCanonicalNamespace\":\"\",\"wgCanonicalSpecialPageName\":false,\"wgNamespaceNumber\":0,\"wgPageName\":\"Napoleon\",\"wgTitle\":\"Napoleon\",\"wgCurRevisionId\":795569844,\"wgRevisionId\":795569844,\"wgArticleId\":69880,\"wgIsArticle\":true,\"wgIsRedirect\":false,\"wgAction\":\"view\",\"wgUserName\":null,\"wgUserGroups\":[\"*\"],\"wgCategories\":[\"CS1 maint: Extra text: authors list\",\"CS1 French-language sources (fr)\",\"Wikipedia indefinitely semi-protected pages\",\"Wikipedia indefinitely move-protected pages\",\"Good articles\",\"Use dmy dates from June 2016\",\"Use British English from August 2015\",\"Pages using infobox royalty with unknown parameters\",\"All articles with vague or ambiguous time\",\"Vague or ambiguous time from August 2016\",\"All articles with unsourced statements\",\"Articles with unsourced statements from June 2016\",\"Articles with unsourced statements from August 2016\",\"All articles with specifically marked weasel-worded phrases\",\"Articles with specifically marked weasel-worded phrases from August 2016\",\"Articles needing additional references from March 2017\",\"All articles needing additional references\",\"Pages using web citations with no URL\",\"Articles with Project Gutenberg links\",\"Articles with Internet Archive links\",\"Interlanguage link template link number\",\"Articles containing French-language text\",\"AC with 18 elements\",\"Wikipedia articles with VIAF identifiers\",\"Wikipedia articles with LCCN identifiers\",\"Wikipedia articles with ISNI identifiers\",\"Wikipedia articles with GND identifiers\",\"Wikipedia articles with SELIBR identifiers\",\"Wikipedia articles with BNF identifiers\",\"Wikipedia articles with BPN identifiers\",\"Wikipedia articles with BIBSYS identifiers\",\"Wikipedia articles with ULAN identifiers\",\"Wikipedia articles with NLA identifiers\",\"Wikipedia articles with SBN identifiers\",\"Wikipedia articles with SNAC-ID identifiers\",\"Napoleon\",\"1769 births\",\"1821 deaths\",\"18th-century rulers in Europe\",\"19th-century monarchs in Europe\",\"Amateur mathematicians\",\"Art collectors\",\"Assassination attempt survivors\",\"Characters in War and Peace\",\"Corsican politicians\",\"Deaths from stomach cancer\",\"French commanders of the Napoleonic Wars\",\"French emperors\",\"French exiles\",\"French military leaders of the French Revolutionary Wars\",\"French people of Italian descent\",\"French Roman Catholics\",\"House of Bonaparte\",\"Kings of Italy\",\"Leaders who took power by coup\",\"Members of the French Academy of Sciences\",\"Monarchs imprisoned and detained during war\",\"Monarchs who abdicated\",\"People excommunicated by the Roman Catholic Church\",\"People from Ajaccio\",\"People of the First French Empire\",\"People of Tuscan descent\",\"Princes of Andorra\"],\"wgBreakFrames\":false,\"wgPageContentLanguage\":\"en\",\"wgPageContentModel\":\"wikitext\",\"wgSeparatorTransformTable\":[\"\",\"\"],\"wgDigitTransformTable\":[\"\",\"\"],\"wgDefaultDateFormat\":\"dmy\",\"wgMonthNames\":[\"\",\"January\",\"February\",\"March\",\"April\",\"May\",\"June\",\"July\",\"August\",\"September\",\"October\",\"November\",\"December\"],\"wgMonthNamesShort\":[\"\",\"Jan\",\"Feb\",\"Mar\",\"Apr\",\"May\",\"Jun\",\"Jul\",\"Aug\",\"Sep\",\"Oct\",\"Nov\",\"Dec\"],\"wgRelevantPageName\":\"Napoleon\",\"wgRelevantArticleId\":69880,\"wgRequestId\":\"WZ-6VwpAIDkAAG8q5zgAAACJ\",\"wgIsProbablyEditable\":false,\"wgRelevantPageIsProbablyEditable\":false,\"wgRestrictionEdit\":[\"autoconfirmed\"],\"wgRestrictionMove\":[\"sysop\"],\"wgFlaggedRevsParams\":{\"tags\":{}},\"wgStableRevisionId\":null,\"wgWikiEditorEnabledModules\":{\"toolbar\":true,\"preview\":false,\"publish\":false},\"wgBetaFeaturesFeatures\":[],\"wgMediaViewerOnClick\":true,\"wgMediaViewerEnabledByDefault\":false,\"wgPopupsShouldSendModuleToUser\":false,\"wgPopupsConflictsWithNavPopupGadget\":false,\"wgVisualEditor\":{\"pageLanguageCode\":\"en\",\"pageLanguageDir\":\"ltr\",\"pageVariantFallbacks\":\"en\",\"usePageImages\":true,\"usePageDescriptions\":true},\"wgPreferredVariant\":\"en\",\"wgMFExpandAllSectionsUserOption\":false,\"wgMFDisplayWikibaseDescriptions\":{\"search\":true,\"nearby\":true,\"watchlist\":true,\"tagline\":false},\"wgRelatedArticles\":null,\"wgRelatedArticlesUseCirrusSearch\":true,\"wgRelatedArticlesOnlyUseCirrusSearch\":false,\"wgULSCurrentAutonym\":\"English\",\"wgNoticeProject\":\"wikipedia\",\"wgCentralNoticeCookiesToDelete\":[],\"wgCentralNoticeCategoriesUsingLegacy\":[\"Fundraising\",\"fundraising\"],\"wgCategoryTreePageCategoryOptions\":\"{\\\"mode\\\":0,\\\"hideprefix\\\":20,\\\"showcount\\\":true,\\\"namespaces\\\":false}\",\"wgWikibaseItemId\":\"Q517\",\"wgCentralAuthMobileDomain\":false,\"wgVisualEditorToolbarScrollOffset\":0,\"wgVisualEditorUnsupportedEditParams\":[\"preload\",\"preloadparams\",\"preloadtitle\",\"undo\",\"undoafter\",\"veswitched\"],\"wgEditSubmitButtonLabelPublish\":false});mw.loader.state({\"ext.gadget.charinsert-styles\":\"ready\",\"ext.globalCssJs.user.styles\":\"ready\",\"ext.globalCssJs.site.styles\":\"ready\",\"site.styles\":\"ready\",\"noscript\":\"ready\",\"user.styles\":\"ready\",\"user\":\"ready\",\"user.options\":\"loading\",\"user.tokens\":\"loading\",\"ext.math.styles\":\"ready\",\"ext.cite.styles\":\"ready\",\"wikibase.client.init\":\"ready\",\"ext.visualEditor.desktopArticleTarget.noscript\":\"ready\",\"ext.uls.interlanguage\":\"ready\",\"ext.wikimediaBadges\":\"ready\",\"mediawiki.legacy.shared\":\"ready\",\"mediawiki.legacy.commonPrint\":\"ready\",\"mediawiki.sectionAnchor\":\"ready\",\"mediawiki.skinning.interface\":\"ready\",\"skins.vector.styles\":\"ready\",\"ext.globalCssJs.user\":\"ready\",\"ext.globalCssJs.site\":\"ready\"});mw.loader.implement(\"user.options@0bhc5ha\",function($,jQuery,require,module){mw.user.options.set([]);});mw.loader.implement(\"user.tokens@1dqfd7l\",function ( $, jQuery, require, module ) {\nmw.user.tokens.set({\"editToken\":\"+\\\\\",\"patrolToken\":\"+\\\\\",\"watchToken\":\"+\\\\\",\"csrfToken\":\"+\\\\\"});/*@nomin*/\n\n});mw.loader.load([\"ext.math.scripts\",\"ext.cite.a11y\",\"site\",\"mediawiki.page.startup\",\"mediawiki.user\",\"mediawiki.hidpi\",\"mediawiki.page.ready\",\"mediawiki.toc\",\"mediawiki.searchSuggest\",\"ext.gadget.teahouse\",\"ext.gadget.ReferenceTooltips\",\"ext.gadget.watchlist-notice\",\"ext.gadget.DRN-wizard\",\"ext.gadget.charinsert\",\"ext.gadget.refToolbar\",\"ext.gadget.extra-toolbar-buttons\",\"ext.gadget.switcher\",\"ext.gadget.featured-articles-links\",\"ext.centralauth.centralautologin\",\"mmv.head\",\"mmv.bootstrap.autostart\",\"ext.visualEditor.desktopArticleTarget.init\",\"ext.visualEditor.targetLoader\",\"ext.eventLogging.subscriber\",\"ext.wikimediaEvents\",\"ext.navigationTiming\",\"ext.uls.eventlogger\",\"ext.uls.init\",\"ext.uls.interface\",\"ext.centralNotice.geoIP\",\"ext.centralNotice.startUp\",\"skins.vector.js\"]);});">]>, #<Nokogiri::XML::Text:0x3fd0694e18c0 "\n">...# hundreds of more lines of response...

Whoa. Not only is that a bunch of information, that information is extremely difficult to parse (for humans, at least). There is quite a bit we can do with this information but again, in order to keep it concise, we will focus on using the css and .text methods to make it a little more readable for us humans.

.css Method

We are going to use the .css method to help parse that giant string. The css method takes a css selector as an argument and returns all of the information pertaining to the selected element. The article on Napoleon is pretty long and we don’t want to be overwhelmed with too much information right off the bat. Let’s keep it simple and return name of each subsection from the article as listed in the table of contents:

Contents

Let’s use the DevTools element selector to see which element we will need to grab in order to map the list of contents:

Selecting the selector

Nice! Looks like the contents can be grabbed by using the span.toctext selector. Let’s edit our method in order to grab the information:

def sample
url = "https://en.wikipedia.org/wiki/Napoleon"
doc = Nokogiri::HTML(open(url))
p doc.css("span.toctext")
end

Invoke this method at the end of the file, then run the file in your terminal. You should get a response that looks like this:

[#<Nokogiri::XML::Element:0x3fc0c145fcec name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c145fabc name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14539d8 "Origins and education">]>, #<Nokogiri::XML::Element:0x3fc0c18417fc name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c184166c name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1840b18 "Early career">]>, #<Nokogiri::XML::Element:0x3fc0c1447930 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c1447804 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1446c24 "Siege of Toulon">]>, #<Nokogiri::XML::Element:0x3fc0c1431c5c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c1430d48 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1891b08 "13 Vendémiaire">]>, #<Nokogiri::XML::Element:0x3fc0c189021c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18912fc name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1880970 "First Italian campaign">]>, #<Nokogiri::XML::Element:0x3fc0c1879e54 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c1879d78 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1456340 "Egyptian expedition">]>, #<Nokogiri::XML::Element:0x3fc0c1869c20 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c1869b80 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1861818 "Ruler of France">]>, #<Nokogiri::XML::Element:0x3fc0c1860648 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18602d8 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1858650 "French Consulate">]>, #<Nokogiri::XML::Element:0x3fc0c18515e4 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c185101c name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14bb8bc "Temporary peace in Europe">]>, #<Nokogiri::XML::Element:0x3fc0c14baffc name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14bae44 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14ba6c4 "French Empire">]>, #<Nokogiri::XML::Element:0x3fc0c18bda8c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18bd758 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c18bc7a4 "War of the Third Coalition">]>, #<Nokogiri::XML::Element:0x3fc0c18bc7b8 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18bd1f4 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14aff94 "Middle-Eastern alliances">]>, #<Nokogiri::XML::Element:0x3fc0c14ae824 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14ae5a4 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14a7114 "War of the Fourth Coalition and Tilsit">]>, #<Nokogiri::XML::Element:0x3fc0c14a7a60 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c149fd24 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14d3a34 "Peninsular War and Erfurt">]>, #<Nokogiri::XML::Element:0x3fc0c14d387c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14d37a0 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14d2724 "War of the Fifth Coalition and Marie Louise">]>, #<Nokogiri::XML::Element:0x3fc0c14d2f08 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14cbf00 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14caf4c "Invasion of Russia">]>, #<Nokogiri::XML::Element:0x3fc0c14ca0d8 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14ca344 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14c3328 "War of the Sixth Coalition">]>, #<Nokogiri::XML::Element:0x3fc0c14c3008 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14c2f68 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14c2824 "Exile to Elba">]>, #<Nokogiri::XML::Element:0x3fc0c14c2680 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14c261c name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14c2144 "Hundred Days">]>, #<Nokogiri::XML::Element:0x3fc0c14f5ee0 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14f5e54 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14f5710 "Exile on Saint Helena">]>, #<Nokogiri::XML::Element:0x3fc0c14f538c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14f5224 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14f47fc "Death">]>, #<Nokogiri::XML::Element:0x3fc0c14f4478 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14f43c4 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14ef8b0 "Cause of death">]>, #<Nokogiri::XML::Element:0x3fc0c14ef2c0 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14ef144 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14e3d1c "Religion">]>, #<Nokogiri::XML::Element:0x3fc0c14e3830 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14e3678 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14dbc5c "Concordat">]>, #<Nokogiri::XML::Element:0x3fc0c14db824 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14db734 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14dac08 "Religious emancipation">]>, #<Nokogiri::XML::Element:0x3fc0c14da17c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14daadc name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c18c9a30 "Personality">]>, #<Nokogiri::XML::Element:0x3fc0c18c9850 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18c97d8 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c18c9314 "Image">]>, #<Nokogiri::XML::Element:0x3fc0c18c90f8 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18c9080 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c18c8b08 "Reforms">]>, #<Nokogiri::XML::Element:0x3fc0c18c8860 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18c87c0 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c18c81a8 "Napoleonic Code">]>, #<Nokogiri::XML::Element:0x3fc0c18c5f5c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18c5ebc name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c18c5778 "Warfare">]>, #<Nokogiri::XML::Element:0x3fc0c18c52f0 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18c5278 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c18c43b4 "Metric system">]>, #<Nokogiri::XML::Element:0x3fc0c18c4198 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18c4120 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14fda78 "Education">]>, #<Nokogiri::XML::Element:0x3fc0c14fd76c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14fd708 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14fcb14 "Memory and evaluation">]>, #<Nokogiri::XML::Element:0x3fc0c14fc8e4 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c14fc858 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c14fc18c "Criticism">]>, #<Nokogiri::XML::Element:0x3fc0c18d5ee8 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18d5e84 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c18d59ac "Propaganda and memory">]>, #<Nokogiri::XML::Element:0x3fc0c18d577c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18d5718 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c18d522c "Long-term influence outside France">]>, #<Nokogiri::XML::Element:0x3fc0c18d4fc0 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18d4f48 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c18d4a84 "Marriages and children">]>, #<Nokogiri::XML::Element:0x3fc0c18d48cc name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18d4854 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c18d4368 "Titles, styles, honours, and arms">]>, #<Nokogiri::XML::Element:0x3fc0c18d4188 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c18d4110 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1509be8 "Ancestry">]>, #<Nokogiri::XML::Element:0x3fc0c150997c name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c15098c8 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c150915c "See also">]>, #<Nokogiri::XML::Element:0x3fc0c1508d74 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c1508cac name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1508540 "Notes">]>, #<Nokogiri::XML::Element:0x3fc0c1508360 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c15082fc name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1519cf0 "Citations">]>, #<Nokogiri::XML::Element:0x3fc0c1519ac0 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c1519a5c name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c15193cc "References">]>, #<Nokogiri::XML::Element:0x3fc0c1519200 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c1519188 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1518cc4 "Biographical studies">]>, #<Nokogiri::XML::Element:0x3fc0c1518aa8 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c1518a44 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c15185a8 "Specialty studies">]>, #<Nokogiri::XML::Element:0x3fc0c15183c8 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c1518364 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1511e24 "Historiography and memory">]>, #<Nokogiri::XML::Element:0x3fc0c1511c44 name="span" attributes=[#<Nokogiri::XML::Attr:0x3fc0c1511bb8 name="class" value="toctext">] children=[#<Nokogiri::XML::Text:0x3fc0c1511690 "External links">]>]

Ugh! Still not easy to parses for us. Luckily theres another method we can use to help clean this up: .text.

.text method

Our css selector is returning an array of nodes that contain information pertinent to the site’s HTML. By chaining on the .text method, we can extract only the text from the response. Let’s edit the code once more:

def sample
url = "https://en.wikipedia.org/wiki/Napoleon"
doc = Nokogiri::HTML(open(url))
p doc.css("span.toctext").text
end

Now our response looks more like this:

"Origins and educationEarly careerSiege of Toulon13 VendémiaireFirst Italian campaignEgyptian expeditionRuler of FranceFrench ConsulateTemporary peace in EuropeFrench EmpireWar of the Third CoalitionMiddle-Eastern alliancesWar of the Fourth Coalition and TilsitPeninsular War and ErfurtWar of the Fifth Coalition and Marie LouiseInvasion of RussiaWar of the Sixth CoalitionExile to ElbaHundred DaysExile on Saint HelenaDeathCause of deathReligionConcordatReligious emancipationPersonalityImageReformsNapoleonic CodeWarfareMetric systemEducationMemory and evaluationCriticismPropaganda and memoryLong-term influence outside FranceMarriages and childrenTitles, styles, honours, and armsAncestrySee alsoNotesCitationsReferencesBiographical studiesSpecialty studiesHistoriography and memoryExternal links"

Wow. That is much better. But it is still a little sloppy. How can we distinguish each span from each other? Well, since the css method returns an array, lets just map over it. We are still using Ruby, after all:

def sample
url = "https://en.wikipedia.org/wiki/Napoleon"
doc = Nokogiri::HTML(open(url))
p doc.css("span.toctext").map { |span| span.text }
end

Run the file again and here is what we get back:

["Origins and education", "Early career", "Siege of Toulon", "13 Vendémiaire", "First Italian campaign", "Egyptian expedition", "Ruler of France", "French Consulate", "Temporary peace in Europe", "French Empire", "War of the Third Coalition", "Middle-Eastern alliances", "War of the Fourth Coalition and Tilsit", "Peninsular War and Erfurt", "War of the Fifth Coalition and Marie Louise", "Invasion of Russia", "War of the Sixth Coalition", "Exile to Elba", "Hundred Days", "Exile on Saint Helena", "Death", "Cause of death", "Religion", "Concordat", "Religious emancipation", "Personality", "Image", "Reforms", "Napoleonic Code", "Warfare", "Metric system", "Education", "Memory and evaluation", "Criticism", "Propaganda and memory", "Long-term influence outside France", "Marriages and children", "Titles, styles, honours, and arms", "Ancestry", "See also", "Notes", "Citations", "References", "Biographical studies", "Specialty studies", "Historiography and memory", "External links"]

Much better!

Conclusion

Now that we have better understanding of how to use Nokogori and Open-URI, go ahead and experiment a bit on your own. It takes some practice, but all you really need is the .css and .text methods (as well as any other Ruby methods) to get started. Try creating a method of your own to see how many times the word “Napoleon” is mentioned, or how many hyperlinks there are, or anything else you think might be worth exploring.

Be sure to check back in the future for Part 2 of this post, where we will use some Object-Oriented Programming techniques with the information we scraped to build something neat.

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade