Pages

Saturday, April 30, 2016

Search Engine Friendly URLs (SEF)


Clean URLs (or search engine friendly URLs) are just that – clean, easy to read, simple.
You do not need clean URLs in site architecture for Google to spider a site successfully (confirmed by Google in 2008), although I do use clean URLs as a default these days, and have done so for years.
It’s often more usable.
Is there a massive difference in Google when you use clean URLs?
No, in my experience it’s very much a second or third order affect, perhaps even less, if used on its own. However – there it is demonstrable benefit to having keywords in URLs.
The thinking is that you might get a boost in Google SERPs if your URLs are clean – because you are using keywords in the actual page name instead of a parameter or session ID number (which Google often struggles with).
I think Google might reward the page some sort of relevance because of the actual file / page name. I optimise as if they do.
It is virtually impossible to isolate any ranking factor with a degree of certainty.
Where any benefit is slightly detectable is when people (say in forums) link to your site with the URL as the link.
Then it is fair to say you do get a boost because keywords are in the actual anchor text link to your site, and I believe this is the case, but again, that depends on the quality of the page linking to your site. That is, if Google trusts it and it passes Pagerank (!) and anchor text benefit.
And of course, you’ll need citable content on that site of yours.
Sometimes I will remove the stop-words from a URL and leave the important keywords as the page title because a lot of forums garble a URL to shorten it. Most forums will be nofollowed in 2016, to be fair, but some old habits die-hard.
Sometimes I prefer to see the exact phrase I am targeting as the name of the URL I am asking Google to rank.
I configure URLs the following way;
  1. www.hobo-web.co.uk/?p=292 — is automatically changed by the CMS using URL rewrite to
  2. www.hobo-web.co.uk/websites-clean-search-engine-friendly-URLs/ — which I then break down to something like
  3. www.hobo-web.co.uk/search-engine-friendly-URLs/
It should be remembered it is thought although Googlebot can crawl sites with dynamic URLs; it is assumed by many webmasters there is a greater risk that it will give up if the URLs are deemed not important and contain multiple variables and session IDs (theory).
As standard, I use clean URLs where possible on new sites these days, and try to keep the URLs as simple as possible and do not obsess about it.
That’s my aim at all times when I optimise a website to work better in Google – simplicity.
Google does look at keywords in the URL even in a granular level.
Having a keyword in your URL might be the difference between your site ranking and not – potentially useful to take advantage of long tail search queries.

Article published on http://www.hobo-web.co.uk/

Link Title Attributes, Acronym & ABBR Tags


Does Google Count Text in The Acronym Tag?
From my tests, no. From observing how my test page ranks – Google is ignoring keywords in the acronym tag.
My observations from a test page I observe include;
  • Link Title Attribute – no benefit passed via the link either to another page, it seems
  • ABBR (Abbreviation Tags) – No
  • Image File Name – No
  • Wrapping words (or at least numbers) in SCRIPT – Sometimes. Google is better at understanding what it can render in 2016.
It’s clear many invisible elements of a page are completely ignored by Google (that would interest us SEO).
Some invisible items are (still) aparently supported:
  • NOFRAMES – Yes
  • NOSCRIPT – Yes
  • ALT Attribute – Yes
Unless you really have cause to focus on any particluar invisible element, I think the **P** tag is the most important tag to optimise in 2016.
Article published on http://www.hobo-web.co.uk/

Alt Tags


NOTE: Alt Tags are counted by Google (and Bing), but I would be careful over-optimizing them. I’ve seen a lot of websites penalised for over-optimising invisible elements on a page. Don’t do it.
ALT tags are very important and I think a very rewarding area to get right. I always put the main keyword in an ALT once when addressing a page.
Don’t optimise your ALT tags (or rather, attributes) JUST for Google!
Use ALT tags (or rather, ALT Attributes) for descriptive text that helps visitors – and keep them unique where possible, like you do with your titles and meta descriptions.
Don’t obsess. Don’t optimise your ALT tags just for Google – do it for humans, accessibility and usability. If you are interested, I conducted a simple test using ALT attributes to determine how many words I could use in IMAGE ALT text that Google would pick up.
And remember – even if, like me most days, you can’t be bothered with all the image ALT tags on your page, at least, use a blank ALT (or NULL value) so people with screen readers can enjoy your page.
Update 17/11/08 – Picked This Up At SERoundtable about Alt Tags:
JohnMu from Google: alt attribute should be used to describe the image. So if you have an image of a big blue pineapple chair you should use the alt tag that best describes it, which is alt=”big blue pineapple chair.” title attribute should be used when the image is a hyperlink to a specific page. The title attribute should contain information about what will happen when you click on the image. For example, if the image will get larger, it should read something like, title=”View a larger version of the big blue pineapple chair image.”
Barry continues with a quote:
As the Googlebot does not see the images directly, we generally concentrate on the information provided in the “alt” attribute. Feel free to supplement the “alt” attribute with “title” and other attributes if they provide value to your users! So for example, if you have an image of a puppy (these seem popular at the moment ) playing with a ball, you could use something like “My puppy Betsy playing with a bowling ball” as the alt-attribute for the image. If you also have a link around the image, pointing a large version of the same photo, you could use “View this image in high-resolution” as the title attribute for the link.

Article published on http://www.hobo-web.co.uk/

Friday, April 29, 2016

H1-H6: Page Headings


I can’t find any definitive proof online that says you need to use Heading Tags (H1, H2, H3, H4, H5, H6) or that they improve rankings in Google, and I have seen pages do well in Google without them – but I do use them, especially the H1 tag on the page.
For me, it’s another piece of a ‘perfect’ page, in the traditional sense, and I try to build a site for Google and humans.
<h1>This is a page title</h1>
I still generally only use one <h1> heading tag in my keyword targeted pages – I believe this is the way the W3C intended it to be used in HTML4 – and I ensure they are at the top of a page above relevant page text and written with my main keywords or related keyword phrases incorporated.
I have never experienced any problems using CSS to control the appearance of the heading tags making them larger or smaller.
You can use multiple H1s in HTML5, but most sites I find I work on still use HTML4.
I use as many H2 – H6 as is necessary depending on the size of the page, but I use H1, H2 & H3. You can see here how to use header tags properly (basically, just be consistent, whatever you do, to give your users the best user experience).
How many words in the H1 Tag? As many as I think is sensible – as short and snappy as possible usually.
I also discovered Google will use your Header tags as page titles at some level if your title element is malformed.
As always be sure to make your heading tags highly relevant to the content on that page and not too spammy, either.
Article published on http://www.hobo-web.co.uk/

Robots.txt File


Why shouldn’t I edit it with my Dreamweaver FTP client, for instance?
Because all those fancy apps insert useless crap like formatting, HTML code and whatnot. Most probably search engines aren’t capable to interpret a robots.txt file like:
DOCTYPE text/plain PUBLIC 
"-//W3C//DTD TEXT 1.0 Transitional//Swahili" 
"http://www.w3.org/TR/text/DTD/plain1-transitional.dtd"> 
{\b\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 
User-agent: Googlebot}
{ \lang2057\langfe1031\langnp2057\insrsid6911344\charrsid11089941 \line 
Disallow: / \line Allow: }{\cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095
{\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 content}{ \cs15\i\lang2057\langfe1031\langnp2057\insrsid6911344\charrsid2903095 /} ...

(Ok Ok, I’ve made up this example, but it represents the raw contents of text files saved with HTML editors and word processors.)
Where Do I put robots.txt
Robots.txt resides in the root directory of your Web space, that’s either a domain or a subdomain, for example
"/web/user/htdocs/example.com/robots.txt"
resolving to
http://example.com/robots.txt.
Can I use Robots.txt in sub directories?
Of course you’re free to create robots.txt files in all your subdirectories, but you shouldn’t expect search engines to request/obey those. If you for some weird reasons use subdomains like crap.example.com, then the example.com/robots.txt is not exactly a suitable instrument to steer crawling of subdomains, hence ensure each subdomain serves its own robots.txt. When you upload your robots.txt then make sure to do it in ASCII mode, your FTP client usually offers “ASCII|Auto|Binary” – choose “ASCII” even when you’ve used an ANSI editor to create it.
Why?
Because plain text files contain ASCII content only. Sometimes standards that say “upload *.htm *.php *.txt .htaccess *.xml files in ASCII mode to prevent them from inadvertently corruption during the transfer, storing with invalid EOL codes, etc.” do make sense. (You’ve asked for the idiot version, didn’t you?)
What about if I am on a Free Host?
If you’re on a free host, robots.txt is not for you. Your hosting service will create a read-only robots.txt “file” that’s suitable to steal even more traffic than its ads that you can’t remove from your headers and footers. Now, if you’re still interested in the topic, you must learn how search engines work to understand what you can archive with a robots.txt file and what’s just myths posted on your favorite forum.
Sebastian, Do you know how search engines work, then?
Yep, to some degree. ;) Basically, a search engine has three major components: A crawler that burns your bandwidth fetching your unchanged files over and over until you’re belly up. An indexer that buries your stuff unless you’re Matt Cutts or blog on a server that gained search engine love making use of the cruelest black hat tactics you can think of. A query engine that accepts search queries and pulls results from the search index but ignores your stuff coz you’re neither me nor Matt Cutts.
What goes into the robots.txt file?
Your robots.txt file contains useful but pretty much ignored statements like
 # Please don't crawl this site during our business hours!
(the crawler is not aware of your time zone and doesn’t grab your office hours from your site), as well as actual crawler directives. In other words, everything you write in your robots.txt is a directive for crawlers (dumb Web robots that can fetch your contents but nothing more), not indexers (high sophisticated algorithms that rank only brain farts from Matt and me).
Currently, there are only three statements you can use in robots.txt:
Disallow: /path
Allow: /path
Sitemap: http://example.com/sitemap.xml
Some search engines support other directives like “crawl-delay”, but that’s utterly nonsense, hence safely igore those.
The content of a robots.txt file consists of sections dedicated to particular crawlers. If you’ve nothing to hide, then your robots.txt file looks like:

 User-agent: *
 Disallow:
 Allow: /
 Sitemap: http://example.com/sitemap.xml 

If you’re comfortable with Google but MSN scares you, then write:

 User-agent: *
 Disallow:
User-agent: Googlebot
 Disallow:
User-agent: msnbot
 Disallow: /

Please note that you must terminate every crawler section with an empty line. You can gather the names of crawlers by visiting a search engine’s Webmaster section.
From the examples above you’ve learned that each search engine has its own section (at least if you want to hide anything from a particular SE), that each section starts with a

 User-agent: [crawler name]
line, and that each section is terminated with a blank line. The user agent name “*” stands for the universal Web robot, that means that if your robots.txt lacks a section for a particular crawler, it will use the “*” directives, and that when you’ve a section for a particular crawler, it will ignore the “*” section. In other words, if you create a section for a crawler, you must duplicate all statements from the “all crawlers” (“User-agent: *”) section before you edit the code.
Now to the directives. The most important crawler directive is

 Disallow: /path
“Disallow” means that a crawler must not fetch contents from URIs that match “/path”. “/path” is either a relative URI or an URI pattern (“*” matches any string and “$” marks the end of an URI). Not all search engines support wildcards, for example MSN lacks any wildcard support (they might grow up some day).
URIs are always relative to the Web space’s root, so if you copy and paste URLs then remove the http://example.com part but not the leading slash.

Allow: path/
refines Disallow: statements, for example
 User-agent: Googlebot 
 Disallow: / 
 Allow: /content/
allows crawling only within http://example.com/content/

Sitemap: http://example.com/sitemap.xml
points search engines that support the sitemaps protocol to the submission files.
Please note that all robots.txt directives are crawler directives that don’t affect indexing. Search engines do index disallow’ed URLs pulling title and snippet from foreign sources, for example ODP (DMOZ – The Open Directory) listings or the Yahoo directory. Some search engines provide a method to remove disallow’ed contents from their SERPs on request.
Say I want to keep a file / folder out of Google. Exactly what what would I need to do? 
You’d check each HTTP request for Googlebot and serve it a 403 or 410 HTTP response code. Or put a “noindex,noarchive” Googlebot meta tag.
(*meta name=”Googlebot” content=”noindex,noarchive” /*). Robots.txt blocks with Disallow: don’t prevent from indexing. Don’t block crawling of pages that you want to have deindexed, as long as you don’t want to use Google’s robots.txt based URL terminator every six months.
If someone wants to know more about robots.txt, where do they go?
Honestly, I don’t know a better resource than my brain, partly dumped here. I even developed a few new robots.txt directives and posted a request for comments a few days ago. I hope that Google, the one and only search engine that seriously invests in REP evolvements, will not ignore this post caused by the sneakily embedded “Google bashing”. I plan to write a few more posts, not that technical and with real world examples.
Can I ask you how you auto generate and mask robots.txt, or is that not for idiots? Is that even ethical?
Of course you can ask, and yes, it’s for everybody and 100% ethical. It’s a very simple task, in fact it’s plain cloaking. The trick is to make the robots.txt file a server sided script. Then check all requests for verified crawlers and serve the right contents to each search engine. A smart robots.txt even maintains crawler IP lists and stores raw data for reports. I recently wrote a manual on cloaked robots.txt files on request of a loyal reader.
Think Disney will come after you for your avatar now you are famous after being interviewd on the Hobo blog?
I’m sure they will try it, since your blog will become an authority on grumpy red crabs called Sebastian. I’m not too afraid though, because I use only a tiny thumbnailed version of an image created by a designer who –hopefully– didn’t scrape it from Disney, as icon/avatar. If they become nasty, I’ll just pay a license fee and change my avatar on all social media sites, but I doubt that’s necessary. To avoid such hassles I’ve bought an individually drawed red crab from an awesome cartoonist last year. That’s what you see on my blog, and I use it as avatar as well, at least with new profiles.
Who do you work for?
I’m a freelancer loosely affiliated with a company that sells IT consulting services in several industries. I do Web developer training, software design / engineering (mostly the architectural tasks), and grab development / (technical) SEO projects myself to educate yours truly. I’m a dad of three little monsters, working at home. If you want to hire me, drop me a line. ;)
Sebastian, a big thanks for slapping me about about Robots.txt and indeed for helping me craft the Idiot’s Guide To Robots.txt. I certainly learned a lot from talking to you for a day, and I hope some others can learn from this snippet article. You’re a gentleman spammer. :)
If you enjoyed this step by step guide for beginners – you can take your knowledge to the next level at http://sebastians-pamphlets.com/
What Google says about Robots txt files
A robots.txt file restricts access to your site by search engine robots that crawl the web. These bots are automated, and before they access pages of a site, they check to see if a robots.txt file exists that prevents them from accessing certain pages. (All respectable robots will respect the directives in a robots.txt file, although some may interpret them differently. However, a robots.txt is not enforceable, and some spammers and other troublemakers may ignore it. For this reason, we recommend password protecting confidential information.) To see which URLs Google has been blocked from crawling, visit the Blocked URLs page of the Health section of Webmaster Tools. You need a robots.txt file only if your site includes content that you don’t want search engines to index. If you want search engines to index everything in your site, you don’t need a robots.txt file (not even an empty one).While Google won’t crawl or index the content of pages blocked by robots.txt, we may still index the URLs if we find them on other pages on the web. As a result, the URL of the page and, potentially, other publicly available information such as anchor text in links to the site, or the title from the Open Directory Project (www.dmoz.org), can appear in Google search results.
Article published on http://www.hobo-web.co.uk/

Robots Meta Tag


Example Robots Meta Tag;
<meta name="robots" content="index, nofollow" />
I could use the above meta tag to tell Google to index the page but not to follow any links on the page, if for some reason, I did not want the page to appear in Google search results.
By default, Googlebot will index a page and follow links to it. So there’s no need to tag pages with content values of INDEX or FOLLOW. GOOGLE
There are various instructions you can make use of in your Robots Meta Tag, but remember Google by default WILL index and follow links, so you have NO need to include that as a command – you can leave the robots meta out completely – and probably should if you don’t have a clue.
Googlebot understands any combination of lowercase and uppercase. GOOGLE.
Valid values for Robots Meta Tag ”CONTENT” attribute are: “INDEX“, “NOINDEX“, “FOLLOW“, and  “NOFOLLOW“.
Example Usage:
  • META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”
  • META NAME=”ROBOTS” CONTENT=”INDEX, NOFOLLOW”
  • META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”
  • META NAME=”ROBOTS” CONTENT=”NOARCHIVE”
  • META NAME=”GOOGLEBOT” CONTENT=”NOSNIPPET”
Google will understand the following and interprets the following robots meta tag values:
  • NOINDEX – prevents the page from being included in the index.
  • NOFOLLOW – prevents Googlebot from following any links on the page. (Note that this is different from the link-level NOFOLLOW attribute, which prevents Googlebot from following an individual link.)
  • NOARCHIVE – prevents a cached copy of this page from being available in the search results.
  • NOSNIPPET – prevents a description from appearing below the page in the search results, as well as prevents caching of the page.
  • NOODP – blocks the Open Directory Project description of the page from being used in the description that appears below the page in the search results.
  • NONE – equivalent to “NOINDEX, NOFOLLOW”.
Robots META Tag Quick Reference
TermsGooglebotSlurpBingBotTeoma
NoIndexYESYESYESYES
NoFollowYESYESYESYES
NoArchiveYESYESYESYES
NoSnippetYESNONONO
NoODPYESYESYESNO
NoYDIRNOYESNONO
NoImageIndexYESNONONO
NoTranslateYESNONONO
Unavailable_AfterYESNONONO
I’ve included the robots meta tag in my tutorial as this IS one of only a few meta tags / HTML head elements I focus on when it comes to managing Googlebot and Bingbot. At a page level – it is a powerful way to control if your pages are returned in search results pages.
These meta tags go in the [HEAD] section of a [HTML] page and represent the only tags for Google I care about. Just about everything else you can put in the [HEAD] of your HTML document is quite unnecessary and maybe even pointless (for Google optimisation, anyway).
Article published on http://www.hobo-web.co.uk/