(Incidentally, USTC's "Jiajia" robot subsequently had a preprogrammed response that accidentally gave an English lesson instead of the scriptwriters' philosophy: "The question 'why do you think the meaning of life is' cannot be answered; 'why' is an improper word to use". Indeed: the proper word "what" had been shortened in the interviewer's speech, so mishearing it as "why" and telling her off was a semi-believable story. But I'm digressing.)
Baidu itself took the second place with a similar "scrape some sentences and auto-translate" page, once again mistranslating "partially sighted" as "no future prospects" and completely missing my own translation. At this point I gave up on the idea of emailing in corrections as it was obvious I wasn't going to figure out who was actually responsible for the machine-translation technology in use.
After that the results lost relevance, and my real site was nowhere to be seen.
Anyway, the desktop site was not showing this "English results" option. Or if it was, it wasn't making it obvious enough for me to find. Was that option pioneered by their mobile team and not yet fully integrated into their older desktop interface? Who knows.
However, I could find the Chinese version of my Gradint page on Baidu's Chinese results, and I could also find the English version (but not the Chinese version) of my Xu Zhimo page (which does contain some Chinese text) on Baidu's Chinese results. Why had just these pages been listed?
A Baidu search for the Gradint page's URL found a mention of it on a scraped copy of a Weibo forum, on which someone had once asked for recommendations of free websites to learn English. At that time I was still using my Weibo account, and Weibo placed this question in my inbox so I thought the questioner was asking me (I realised later that Weibo had an algorithm for placing public questions into inboxes; the human questioner had not singled me out as I'd thought). I'd replied saying I might not be the best person to ask because as a native I didn't learn English online myself and I don't know what sites are blocked in your country, but you could try howjsay, Forvo, BBC or my Gradint, and I invited the questioner to PM me to discuss their specific interests to see if I could suggest other appropriate material. (They didn't PM me, and it's too late now my Weibo's been stolen.) It now seemed this one mention of Gradint on a Weibo forum had been sufficient to get that page---and only that page---onto Baidu's Chinese index. Similarly, a few people had linked to my Xu Zhimo page from some Chinese forum sites, and those links had to go to the English version as the Chinese version wasn't yet published.
So apparently Baidu were listing Chinese-language pages one hop away from big Chinese sites, but not two hops. They had listed my Chinese Gradint page that had been mentioned on the Weibo forum, but they had not followed the Chinese Gradint page's link to my Chinese home page. And they had listed the English version of my Xu Zhimo page (calling it a Chinese result because it does contain some Chinese text), but had not followed that page's link to the version with all-Chinese notes.
Why was my Chinese Gradint page, which Baidu had correctly identified as a Chinese-language page, not regarded as carrying as much authority as a Weibo forum for giving Baidu additional Chinese-language links, such as its link to my Chinese home page? I don't know, but I guessed it might be because Baidu has a webmaster registration scheme and they might be setting their cutoff for Chinese pages at "one hop away from registered Chinese sites". So I reasoned I should try to make my Chinese home page a "registered Chinese site", or, if this wasn't possible, at least try to get a registered Chinese site elsewhere to link to all my Chinese pages, so they're all only one hop away from a registered site.
Then it turned out Baidu don't accept registrations of user subdirectories on a university server: you have to own a whole domain. So I used my SRCF subdomain, which also had the benefit of giving me access to the Apache request logs, which, I hoped, could give more insight into what Baidu was doing. (I'm not allowed to read logs for my subdirectory on the DS server, but I can read them for my SRCF subdomain.)
I tried a '301-redirect' from the top of my SRCF subdomain to my Chinese home page, but Baidu's registration validator said "unknown error 301" (it pretended to be Firefox 20 on Windows 7 but hadn't been programmed to understand 301-redirects). So I placed the registration-validation file on the SRCF itself, and then the registration process worked.
But I didn't want to move my actual homepage unnecessarily, so I made a simple Chinese site-map for Baidu to crawl (I generally avoided including my English-only pages in that list, because Baidu's English results didn't seem to have the "one hop away from a registered domain" restriction, and I wanted my "registered domain" to be 'considered' Chinese if they ever made statistics on it). If Baidu was going to treat my SRCF subdomain in the same way as it treated the Weibo forum, a list of links on the SRCF should be sufficient to make it index my existing pages on the DS server, and meanwhile I hope Google et al won't count these as "bad links".
GET /robots.txtfrom a Baidu-registered IP pretending to be Firefox 6 on Windows XP. (Note however that not all of the IP addresses Baidu uses are necessarily registered to Baidu: the IPs it had used during validation were registered to China Telecom Beijing and China Mobile.)
Unfortunately I had made a mistake in that
robots.txt: I had put
Disallow: /*.cgi in it. The Robots Exclusion Standard does not define what
* does in a
Disallow line (it specifies only that
* can be used as a universal selector in a
User-Agent line), so it's up to the programmers of each individual robot how to interpret
* in a
Disallow line. Baidu did not send any further requests. Had it stopped parsing at the
* and abandoned its crawl?
Four days later the SRCF server saw two more requests, both arriving in the same second. These came from IPs registered to China Unicom Beijing and to ChinaNet Beijing. The first asked for
/robots.txt in the name of Baiduspider/2.0, and the second asked for
/ in the name of a Samsung Galaxy S3 phone configured to the UK locale. Well that's odd---I thought the part of Baidu that correctly identifies itself was the English-results department, developed separately from the Chinese department that sends out the old browser headers I'd seen during validation, so I didn't expect a correctly-identified request to be followed so rapidly by one that wasn't. (And it's rather unlikely to be a real consumer browser that just happened to arrive in the same second: who on earth walks around Beijing with a UK-configured 5-year-old phone and just randomly decides to load up the root page of my SRCF subdomain, which I wasn't using before, with no referring page, once only, coincidentally just after the robot.) So had some department of Baidu actually indexed
/ despite the fact that I hadn't yet fixed the dubious
But then the next day saw a repeat request for
robots.txt from "Firefox 6 on XP" at a Baidu-registered IP, 5 days after the first. I still hadn't fixed that
*, and Baidu (or at least that department) still didn't fetch anything else.
At that point I fixed the
* issue; 6 days later a Baidu-registered IP again fetched
robots.txt from "Firefox 6 on XP" and then after 4½ hours a ChinaNet Beijing IP once again fetched
/ from the "Galaxy S3"---had the
robots.txt finally worked, or was that just a repeat request from the other department?
and I have now done this to my Chinese-language pages "just in case", but I'm not really sure it's necessary because Baidu had indexed my Chinese Gradint page when it had just the more-modern
<meta http-equiv="content-language" content="zh-Hans">
or for a bilingual page
<meta http-equiv="content-language" content="zh-Hans, en">
langattributes. Of course I still use modern
langattributes as well---they can specify the language of individual elements and text spans, which is also good for some screen readers, and for font selection in some visual browsers, when including phrases not in the page's main language.
For the next 4 months I continued to see occasional checks from Baidu IPs (perhaps to find out how often my site changes), but my Chinese homepage still failed to appear in their search results. It's likely that Baidu update some parts of their index more frequently than others and have given me a low priority, but I was surprised they were taking this long. They did list my Chinese-English page saying Xu Zhimo's poem is not about Lin Huiyin (although giving no results for that URL in a link search), but they did not list the Chinese version of the poem page itself. It can't be that they de-list Chinese versions of foreign-hosted pages that also have English versions listed in their
link hreflang markup, as that wouldn't explain why they did list my Chinese Gradint page which has the same thing.
It wasn't making much sense.
zhidao.baidu.com). This also required going through a sign-up process where you needed a mainland China mobile number (and annoyingly you're told this only after you've gone to the trouble of typing out your Chinese question and optionally associating your QQ or Weibo account), so I asked the same Chinese friend to post on my behalf.
The question (in Chinese) basically said "why can't Baidu find these Chinese pages on a Cambridge University server" and then listed the URLs I wanted to be indexed. I wasn't expecting any serious answers (I'm told Baidu's forum has a reputation for low quality), but I hoped the URL list in the question itself might finally prompt Baidu to index them, as they are now just one hop away from Baidu's own forum (if they count Weibo's forum, surely they should count their own, I thought). Of course I would also be happy if I got a good answer, so it wasn't entirely done under false pretences, but I hope posting questions with hidden agendas like that is not in breach of some Terms and Conditions of which I'm unaware.
(I thought the 'hidden' purpose of that URL list was obvious, but my Chinese friend---also a programmer---didn't see it until I pointed it out; his initial reaction had been to ask if I'm sure the low-quality forum would be good enough for my question. So perhaps I was suffering from "hindsight bias": a plan is less obvious if you don't already know. Perhaps in the interests of 'transparency' we should have added a postscript to the question. Still, it's not as if I'm trying to cheat on a page-ranking system: I'm just trying to be in Baidu's index as opposed to not being in Baidu's index.)
I waited a month, and Baidu still hadn't indexed these pages. "Baidu Knows" didn't seem to be helping.
I'll update this page if I learn of any further developments.