How to rename Unicode Chinese files to Pinyin?

I googled and found no existing tools. That’s very rare and even weird. The only useful piece of information I got was a mapping file that contains a Unicode Pinyin table. So I have to do it myself… to write a script to convert the Unicode Chinese file names to Pinyin using the mapping file.
Since I was doing Python Challenge at the time, natually I just scripted something in Python to get the job done.

The reason I did that was this. I have a HDTV that has a feature to play music from an USB drive. When I wanted to play the songs I downloaded from the Voice of China. I had a problem. The file name of the songs had many Unicode Chinese characters. The TV obviously doesn’t support Unicode. It just doesn’t display those Chinese characters at all. For example:

01 04张玮 – High歌.mp3
05 09吉克隽逸 – I Fell Good.mp3

I can only see:
01 04 – High.mp3
05 09 – I Fell Good.mp3

If those above are okay, then the following ones are ridiculous:
11 11 – .mp3
11 12 – .mp3
11 13 – .mp3
11 14 – .mp3

I have no idea what was what when I tried to choose the songs. Actually their filenames are as the following:
11 11大山 – 王妃.mp3
11 12王韵壹 – 你快乐所以我快乐.mp3
11 13金池 – 后知后觉.mp3
11 14吴莫愁 – 痒.mp3

Putting the mapping file and the script in one folder, all renaming Unicode files under a sub folder “VoC”, then just run the script. Finally I got all the file names like this, not perfect but I am able to tell what songs they are:
11 11 DaShan – WangFei.mp3
11 12 WangYunYi – NiKuaiLeSuoYiWoKuaiLe.mp3
11 13 JinChi – HouZhiHouJue.mp3
11 14 WuMoChou – Yang.mp3

I hope you find my solution helpful. Here is my Python script.

# renameCH2Pinyin.py
# Rename filename from Chinese characters to capitalized pinyin using the
# mapping file and taking out the tone numbers

import os
import re

# File uni2pinyin is a mapping from hex to Pinyin with a tone number
f = open('uni2pinyin')
wf = f.read() # read the whole mapping file

os.chdir('voc') # to rename all files in sub folder 'voc'
myulist = os.listdir(u'.') # read all file names in unicode mode
for x in myulist: # each file name
    filenamePY = ''
    for y in x: # each character
        if 0x4e00 <= ord(y) <= 0x9fff: # Chinese Character Unicode range
            hexCH = (hex(ord(y))[2:]).upper() # strip leading '0x' and change
                                              # to uppercase
            p = re.compile(hexCH+'\t([a-z]+)[\d]*') # define the match pattern
            mp = p.search(wf)
            filenamePY+=mp.group(1).title() # get the pinyin without the tone
                                            # number and capitalize it
        else:
            filenamePY+=y
    print x
    filename = filenamePY
    print filename
    os.rename(x, filename)
os.chdir('..') # go back to the parent folder

This is the link where I got the mapping file:

ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/Uni2Pinyin.gz
Advertisements

12 thoughts on “How to rename Unicode Chinese files to Pinyin?

    • nope. not python particularly. just steps to make this script work. I have followed steps from other forums which script originally from your site. But it didn;t work. I have bunch of folders and subfolders which are in chinese. I would appreciate if you can teach me how to make this script works for my situation. tx

  1. Hi antoyono,
    It’s hard for me to troubleshoot your problem without detailed information of what you have done. But I can explain a little more about the script here on top of the comments along with the script. In order for the script to work, please make sure you have done the following steps:
    Assume you want to rename all files in a folder ‘C:\working\antoch2py’. Make sure you have backup of all the files in that folder somewhere else.
    1. Install python 2.7 for your OS (Windows or Linux) and check if it works
    2. Put my script in the folder of ‘working’
    3. Download the mapping file and unzip ‘uni2pinyin’ to the same folder as where my script is
    4. Modify my script line 12, change ‘voc’ to ‘antoch2py’ or ‘C:\\working\\antoch2py’
    5. Run my script in one of the following 2 ways:

    • command line:
    C:\working>python renameCH2Pinyin.py
    
    • in any python IDE:
    >>> import os
    >>> os.chdir('C:\\working')
    >>> execfile('renameCH2Pinyin.py')
    

    The script is just to demo how to do the job, it’s not intended to work in more situations. I just wanted to give my idea. I understand it’s hard for people who don’t know python. I will improve the script to be flexible when I have time. But Python is a scripting language, at least you have to make Python work no matter how perfect the script is.

    • thanks for such a prompt reply. I will try out first. will get back a little while. tx so much.

  2. hi, hxin. I try your steps and it works. My last request is, the step you gave can only change one folder. how about if inside that folder, has many subfolders. Instead of keep copying the file to the main folder, can you make the script to work for subfolders too?
    I ran about the other forums which modify your script to work for subfolders. the script is below:

    # ch2Pinyin.py
    # Rename filename from Chinese characters to capitalized pinyin using the
    # mapping file and taking out the tone numbers

    import os
    import re

    def processDirectory ( args, dirname, filenames ):
    print dirname
    os.chdir(dirname) # to rename all files in sub folder %dirname
    myulist = os.listdir(u’.’) # read all file names in unicode mode

    for x in myulist: # each file name
    filenamePY = ”
    for y in x: # each character
    if 0x4e00 <= ord(y) <= 0x9fff: # Chinese Character Unicode range
    hexCH = (hex(ord(y))[2:]).upper() # strip leading '0x' and change to uppercase
    p = re.compile(hexCH+'\t([a-z]+)[\d]*') # define the match pattern
    mp = p.search(wf)
    filenamePY+=mp.group(1).title() # get the pinyin without the tone number and capitalize it
    else:
    filenamePY+=y
    print " " * 4 + x
    Nfilename = filenamePY[:9]+filenamePY[9:]
    print " " * 6 + Nfilename
    os.rename(x,Nfilename)

    # File Uni2Pinyin is a mapping from hex to Pinyin with a tone number
    f = open('Uni2Pinyin')
    wf = f.read() # read the whole mapping file

    base_dir = "/D:\Chinese to PinYin\Chinese to Pinyin Song\"
    os.path.walk( base_dir, processDirectory, None )

    My directory song is all under D:\Chinese to PinYin\Chinese to Pinyin Song\. But when i run the module. It stated: EOL while scanning literal string.

    anything wrong?

    • Try this:

      base_dir = "D:\\Chinese to PinYin\\Chinese to Pinyin Song\\"
      

      and I found a typo in this line:

      myulist = os.listdir(u’.') # read all file names in unicode mode
      

      double check the quotation mark:

      myulist = os.listdir(u'.') # read all file names in unicode mode
      
  3. hi, it finally works. tx a lot. but i just wonder, anything can be done to put space between each pinyin? somehow when i use search in win 7, the search can’t find the letters i put, for example: WoAiNi.mp3. I typed ‘ai’. The result was empty. But if I put space between each pinyin Wo Ai Ni.mp3. Then the result come up.

  4. That’s simple. Just put a space after each pinyin.
    You can modify my script line 22:

                filenamePY+=mp.group(1).title()+' '
    

    Oh, by the way, in my script line 27 doesn’t need to be that way, that was for a particular situation of the file names when I fist wrote the script. It should be simply like this:

        filename = filenamePY
    
  5. Since indentation is important in Python, I have edited the script you posted for other visitors. Also, you may provide where you got the script if you remember.

    # ch2Pinyin.py
    # Rename filename from Chinese characters to capitalized pinyin using the
    # mapping file and taking out the tone numbers.  
    
    import os
    import re
    
    def processDirectory(args, dirname, filenames):
        print dirname
        os.chdir(dirname)   # Rename all files in sub folder %dirname
        myulist = os.listdir(u'.')  # Read all file names in unicode mode
    
        for x in myulist:   # Each file name
            filenamePY = ''
            for y in x:     # Each character
                if 0x4e00 <= ord(y) <= 0x9fff:  # Chinese Character Unicode range
                    # Strip leading '0x' and change the rest to uppercase:
                    hexCH = (hex(ord(y))[2:]).upper()
                    # Define the match pattern:
                    p = re.compile(hexCH+'\t([a-z]+)[\d]*')
                    mp = p.search(wf)
                    # Get the pinyin without tone number and capitalize it.  
                    filenamePY+=mp.group(1).title()
                else:
                    filenamePY+=y
            print " " * 4 + x
            Nfilename = filenamePY
            print " " * 6 + Nfilename
            os.rename(x, Nfilename)
    
    # File Uni2Pinyin is a mapping from hex to Pinyin with a tone number.  
    f = open('Uni2Pinyin')
    wf = f.read()   # Read the whole mapping file
    
    base_dir = 'D:\\Chinese to PinYin\\Chinese to Pinyin Song\\'
    os.path.walk(base_dir, processDirectory, None)
    

    Edit:
    # Adding reference:
    # Thank you antoyono for providing the link.
    # This script is enhanced by cwtim01 from here. I may have formatted it following PEP 8.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s