Search what you want

Wednesday, December 13, 2017

Building new (.dic) Dictionary (lexicon) for Sphinx Voice Recognition

1. create indonesian common word list : http://indodic.com/IndMostComList.html,
save to "indocommonlist.txt"
ada
adalah
agar
air
akan
akibat
aku
anak
anda
antara
apa
atas
bagi
bagian
bagus
bahkan
bahwa
baik
banyak
barang
baru
beberapa
begitu
belum
benar
bersama
besar
biasa
bila
bisa
boleh
buah
bukan
cara
cepat
cukup
dalam
dan
dapat
dari
datang
dengan
depan
di
dia
dilakukan
diri
dulu
hal
hanya
harga
hari
harus
hati
hidup
hingga
ia
ingin
ini
jadi
jalan
jangan
jauh
jelas
jika
juga
jumlah
juta
kalau
kali
kami
kamu
karena
kata
ke
kecil
kecuali
kembali
kemudian
kepada
kepala
kerja
ketika
khusus
kini
kita
ku
kurang
lagi
lain
lalu
lama
langsung
luar
maka
makan
makanan
makin
malam
mampu
mana
masalah
masih
masuk
mata
mau
maupun
melakukan
melalui
memang
memberi
memberikan
membuat
memiliki
mencari
mengatakan
menjadi
menurut
merasa
mereka
merupakan
mudah
mulai
mungkin
nama
namun
nanti
oleh
orang
pada
paling
perlu
pernah
pertama
pulang
punya
pusat
saat
saja
salah
sama
sambil
sampai
sangat
saya
sebab
sebagai
sebelum
sebuah
secara
sedang
sedikit
segera
sehingga
sejak
sekali
sekalipun
sekarang
sekitar
selalu
selama
seluruh
sementara
semua
sendiri
seperti
sering
serta
sesuai
sesuatu
setelah
setiap
siap
sini
suatu
sudah
tahu
tak
tanpa
tapi
telah
tempat
tengah
tentang
terhadap
terjadi
terlalu
termasuk
tersebut
terus
tetapi
tiba
tidak
tinggi
uang
untuk
waktu
yaitu
yang
2. sperate each word. save to newcommonword_lex.dic
ada a d a
adalah a d a l a h
agar a g a r
 :
yang y a ng
3. install g2p-seq2seq : https://github.com/cmusphinx/g2p-seq2seq
$ git clone https://github.com/cmusphinx/g2p-seq2seq.git
$ sudo pip install --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.0.0-cp27-none-linux_x86_64.whl
$ sudo python setup.py install
4. train new model
$ g2p-seq2seq --train newcommonword_lex.dic --model commonwordmodeldic
wait until finish

output:
Preparing G2P data
Creating vocabulary commonwordmodeldic/vocab.phoneme
Creating vocabulary commonwordmodeldic/vocab.grapheme
Reading development and training data.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE3 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
Creating model with parameters:
Learning rate:        0.5
LR decay factor:      0.99
Max gradient norm:    5.0
Batch size:           64
Size of layer:        64
Number of layers:     2
Steps per checkpoint: 200
Max steps:            0
Optimizer:            sgd

Created model with fresh parameters.
global step 200 learning rate 0.5000 step-time 0.08 perplexity 6.41
  eval: perplexity 2.20
global step 400 learning rate 0.5000 step-time 0.08 perplexity 2.10
  eval: perplexity 1.66
global step 600 learning rate 0.5000 step-time 0.08 perplexity 1.18
  eval: perplexity 1.55
global step 800 learning rate 0.5000 step-time 0.05 perplexity 1.02
  eval: perplexity 1.60
No improvement over last 1 times. Training will stop after -1iterations if no improvement was seen.
Training done.
Loading vocabularies from commonwordmodeldic
Creating 2 layers of 64 units.
Reading model parameters from commonwordmodeldic
Beginning calculation word error rate (WER) on test sample.
Words: 20
Errors: 16
WER: 0.800
Accuracy: 0.200

5. run new model data
$ g2p-seq2seq --interactive --model commonwordmodeldic
input >> aku
output >>a k u
input >> memakan
output >> m e m a k a n
input >> makanan
output >> m a k a n a n


repository:
https://github.com/ardhimaarik/lexiconBIndo

No comments:

Post a Comment