Policy learning for time-bounded reachability in continuous-time Markov decision processes via doubly-stochastic gradient ascent