Back to Silas S. Brown's home page

Adding Chinese web pages to Baidu (unsolved)

My University-provided personal web space has always been indexed by the globally-dominant "Google" search engine, but as a public service I thought I should also check if the few Chinese-language pages I have are indexed by China's dominant "Baidu" search engine. These notes are posted in case anyone finds them useful.

Dreadful machine translation

To begin with, I tried to find out if Baidu had already indexed my English home page, by searching Baidu for the quoted sentence "I am a partially-sighted computer scientist at Cambridge University in the UK." The first hit (in mid-2017) was a Chinese site called zybang (short for homework help) which had evidently scraped the first three sentences from my home page, added no fewer than 25 advertising spaces (the obvious aim was to intercept search traffic for English sentences---do they try to scrape every English sentence on the web?---and earn money from advertisement impressions), and added a pretty awful machine translation: they rendered "partially sighted" as "without any future". (What? Chinese robots officially declared partially-sighted people to have no future? Is that because USTC's "robot goddess" project, which replied "no, not at all" to the question "do you like robots or humans" in a publicity stunt, has no use for the visually impaired as we can't be seduced by the bot's visual acts? I'd rather believe they are simple programs with no idea what they're saying.) In all fairness, an alternate translation suggested further down zybang's page was "partly farsighted", as in part of me can predict the future, so I tried asking that part "when will zybang fix their translations" but got no answer. The scraper had not been clever enough to find my own translation of that sentence on the Chinese version of my home page---OK so my Chinese skills are not native, but I hope they're not as bad as that machine translation. I explained this in a Chinese email to zybang (and asked if they could at least link to my real site) but I don't know if they got it.

(Incidentally, USTC's "Jiajia" robot subsequently had a preprogrammed response that accidentally gave an English lesson instead of the scriptwriters' philosophy: "The question 'why do you think the meaning of life is' cannot be answered; 'why' is an improper word to use". Indeed: the proper word "what" had been shortened in the interviewer's speech, so mishearing it as "why" and telling her off was a semi-believable story. But I'm digressing.)

Baidu itself took the second place with a similar "scrape some sentences and auto-translate" page, once again mistranslating "partially sighted" as "no future prospects" and completely missing my own translation. At this point I gave up on the idea of emailing in corrections as it was obvious I wasn't going to figure out who was actually responsible for the machine-translation technology in use.

After that the results lost relevance, and my real site was nowhere to be seen.

Baidu Mobile "English results"

The above test was conducted on Baidu's desktop version. A search for the same phrase on Baidu's mobile version (well, the first 9 words of it anyway: the input field was limited to 64 characters) did not show the zybang site, but did give Baidu's own machine translation inline (this time they rendered "partially sighted" as "lazy eye"), and also gave an offer to look at a separate page of "English results" which didn't at first include my page. But a search for "Silas Brown Cambridge" did find my page (along with an old copy of my stolen Weibo account), and a search for "I come from rural West Dorset on the South-West peninsula" put me in the second hit of the 'English results' (the Chinese results were scraped machine translations again). The following day, a repeat search for the original phrase (which had previously not found me at all) put me at the top of the 'English results' (finally)---had my previous day's tests caused it to do some reprocessing, or was it just a matter of which part of their cluster happened to be handling my requests that day?

Anyway, the desktop site was not showing this "English results" option. Or if it was, it wasn't making it obvious enough for me to find. Was that option pioneered by their mobile team and not yet fully integrated into their older desktop interface? Who knows.

Indexed via Weibo forum?

The above preliminary tests were checking what Baidu had done with my English pages, but what I really wanted them to do was to list my Chinese-language pages in their Chinese results. They weren't doing that: the Chinese equivalents of these test searches (on both desktop and mobile) weren't finding my Chinese home page at all.

However, I could find the Chinese version of my Gradint page on Baidu's Chinese results, and I could also find the English version (but not the Chinese version) of my Xu Zhimo page (which does contain some Chinese text) on Baidu's Chinese results. Why had just these pages been listed?

A Baidu search for the Gradint page's URL found a mention of it on a scraped copy of a Weibo forum, on which someone had once asked for recommendations of free websites to learn English. At that time I was still using my Weibo account, and Weibo placed this question in my inbox so I thought the questioner was asking me (I realised later that Weibo had an algorithm for placing public questions into inboxes; the human questioner had not singled me out as I'd thought). I'd replied saying I might not be the best person to ask because as a native I didn't learn English online myself and I don't know what sites are blocked in your country, but you could try howjsay, Forvo, BBC or my Gradint, and I invited the questioner to PM me to discuss their specific interests to see if I could suggest other appropriate material. (They didn't PM me, and it's too late now my Weibo's been stolen.) It now seemed this one mention of Gradint on a Weibo forum had been sufficient to get that page---and only that page---onto Baidu's Chinese index. Similarly, a few people had linked to my Xu Zhimo page from some Chinese forum sites, and those links had to go to the English version as the Chinese version wasn't yet published.

So apparently Baidu were listing Chinese-language pages one hop away from big Chinese sites, but not two hops. They had listed my Chinese Gradint page that had been mentioned on the Weibo forum, but they had not followed the Chinese Gradint page's link to my Chinese home page. And they had listed the English version of my Xu Zhimo page (calling it a Chinese result because it does contain some Chinese text), but had not followed that page's link to the version with all-Chinese notes.

Why was my Chinese Gradint page, which Baidu had correctly identified as a Chinese-language page, not regarded as carrying as much authority as a Weibo forum for giving Baidu additional Chinese-language links, such as its link to my Chinese home page? I don't know, but I guessed it might be because Baidu has a webmaster registration scheme and they might be setting their cutoff for Chinese pages at "one hop away from registered Chinese sites". So I reasoned I should try to make my Chinese home page a "registered Chinese site", or, if this wasn't possible, at least try to get a registered Chinese site elsewhere to link to all my Chinese pages, so they're all only one hop away from a registered site.

Registering with Baidu

Firstly I needed a mainland China mobile number. So I asked a Chinese friend who'd collaborated on my Chinese computer-voice downloads. Presumably he's the one they'd call if there's a legal problem. I'm not planning on creating any problems, but they want an identifiable Chinese citizen on their books for every site, and it could be a heavy responsibility if you don't know what you're vouching for. So I asked someone who'd had some input into my material (and therefore a legitimate interest in registering it), and I will not accept requests to link random other sites via this registration.

Then it turned out Baidu don't accept registrations of user subdirectories on a university server: you have to own a whole domain. So I used my SRCF subdomain, which also had the benefit of giving me access to the Apache request logs, which, I hoped, could give more insight into what Baidu was doing. (I'm not allowed to read logs for my subdirectory on the DS server, but I can read them for my SRCF subdomain.)

I tried a '301-redirect' from the top of my SRCF subdomain to my Chinese home page, but Baidu's registration validator said "unknown error 301" (it pretended to be Firefox 20 on Windows 7 but hadn't been programmed to understand 301-redirects). So I placed the registration-validation file on the SRCF itself, and then the registration process worked.

But I didn't want to move my actual homepage unnecessarily, so I made a simple Chinese site-map for Baidu to crawl (I generally avoided including my English-only pages in that list, because Baidu's English results didn't seem to have the "one hop away from a registered domain" restriction, and I wanted my "registered domain" to be 'considered' Chinese if they ever made statistics on it). If Baidu was going to treat my SRCF subdomain in the same way as it treated the Weibo forum, a list of links on the SRCF should be sufficient to make it index my existing pages on the DS server, and meanwhile I hope Google et al won't count these as "bad links".

robots.txt issues?

Three days after submission, the SRCF server saw a GET /robots.txt from a Baidu-registered IP pretending to be Firefox 6 on Windows XP. (Note however that not all of the IP addresses Baidu uses are necessarily registered to Baidu: the IPs it had used during validation were registered to China Telecom Beijing and China Mobile.)

Unfortunately I had made a mistake in that robots.txt: I had put Disallow: /*.cgi in it. The Robots Exclusion Standard does not define what * does in a Disallow line (it specifies only that * can be used as a universal selector in a User-Agent line), so it's up to the programmers of each individual robot how to interpret * in a Disallow line. Baidu did not send any further requests. Had it stopped parsing at the * and abandoned its crawl?

Four days later the SRCF server saw two more requests, both arriving in the same second. These came from IPs registered to China Unicom Beijing and to ChinaNet Beijing. The first asked for /robots.txt in the name of Baiduspider/2.0, and the second asked for / in the name of a Samsung Galaxy S3 phone configured to the UK locale. Well that's odd---I thought the part of Baidu that correctly identifies itself was the English-results department, developed separately from the Chinese department that sends out the old browser headers I'd seen during validation, so I didn't expect a correctly-identified request to be followed so rapidly by one that wasn't. (And it's rather unlikely to be a real consumer browser that just happened to arrive in the same second: who on earth walks around Beijing with a UK-configured 5-year-old phone and just randomly decides to load up the root page of my SRCF subdomain, which I wasn't using before, with no referring page, once only, coincidentally just after the robot.) So had some department of Baidu actually indexed / despite the fact that I hadn't yet fixed the dubious * in robots.txt?

But then the next day saw a repeat request for robots.txt from "Firefox 6 on XP" at a Baidu-registered IP, 5 days after the first. I still hadn't fixed that *, and Baidu (or at least that department) still didn't fetch anything else.

At that point I fixed the * issue; 6 days later a Baidu-registered IP again fetched robots.txt from "Firefox 6 on XP" and then after 4½ hours a ChinaNet Beijing IP once again fetched / from the "Galaxy S3"---had the robots.txt finally worked, or was that just a repeat request from the other department?

Language tagging

Allegedly Baidu still requires the old method of language tagging at the document level, i.e.
<meta http-equiv="content-language" content="zh-Hans">
or for a bilingual page
<meta http-equiv="content-language" content="zh-Hans, en">
and I have now done this to my Chinese-language pages "just in case", but I'm not really sure it's necessary because Baidu had indexed my Chinese Gradint page when it had just the more-modern lang attributes. Of course I still use modern lang attributes as well---they can specify the language of individual elements and text spans, which is also good for some screen readers, and for font selection in some visual browsers, when including phrases not in the page's main language.

Slow to index

For the next 8 weeks I continued to see occasional checks from Baidu IPs (perhaps to find out how often my site changes), but my Chinese homepage still failed to appear in their search results. It's likely that Baidu update some parts of their index more frequently than others and have given me a low priority, but I'm surprised they're taking this long. They did list my Chinese-English page saying Xu Zhimo's poem is not about Lin Huiyin (although giving no results for that URL in a link search), but they did not list the Chinese version of the poem page itself. It can't be that they de-list Chinese versions of foreign-hosted pages that also have English versions listed in their link hreflang markup, as that wouldn't explain why they did list my Chinese Gradint page which has the same thing.

It's not making much sense. I'll update these notes if I notice any further developments.

All material © Silas S. Brown unless otherwise stated.